Skip to content

Testing samples.earth with Gleaner

Douglas Fils edited this page May 26, 2020 · 3 revisions

About

Just a quick example of using Gleaner (https://gleaner.io) against the IGSN Sprint test site at https://samples.earth. As things change during the sprint I will update this page.

Run

This is using the latest build from the sitemap_update branch [1] (version 2.0.21). The configuration file I used is at the bottom of this page. I wont detail using gleaner here. You can see https://gleaner.io for that and there is a screen-cast there too of a run with a slightly older version.

Run output

Just a dump of the run output from my laptop:

go run ../../cmd/gleaner/main.go -cfg config
main.go:30: EarthCube Gleaner
main.go:89: Validating access to object store
check.go:38: Verfied Gleaner bucket: gleaner.
summoner.go:16: Summoner start time: 2020-05-26 07:51:35.001556173 -0500 CDT m=+0.007901988 
resources.go:42: Parsing sitemap: https://samples.earth/sitemap.xml
resources.go:52: map[after: mode:diff]
resources.go:64: Get with no date
resources.go:73: samplesearth : 10010
acquire.go:29: Queuing URLs for samplesearth 
samplesearth    2m48s [--------------------------------------------------------------------] 100%
summoner.go:30: Summoner end time: 2020-05-26 07:54:24.867379638 -0500 CDT m=+169.873725573 
millers.go:28: Miller start time: 2020-05-26 07:54:24.867455482 -0500 CDT m=+169.873801336 
samplesearth    2m48s [--------------------------------------------------------------------] 100%
miller            --- [                                                                    ]   0%
samplesearth    2m48s [--------------------------------------------------------------------] 100%
miller          1m11s [------------------------------------------------------------------->] 100%
graphng.go:82: Assembling result graph for prefix: summoned/samplesearth to: milled/samplesearth
samplesearth    2m48s [--------------------------------------------------------------------] 100%
miller          1m11s [--------------------------------------------------------------------] 100%
graphng.go:89: Pipe copy for graph done
millers.go:70: Miller end time: 2020-05-26 07:56:01.18513208 -0500 CDT m=+266.191477866 
millers.go:71: Miller run time: 1.605295 

Results

The results of this run are loaded in the sprint graph at http://graph.openknowledge.network/blazegraph/#splash and are used for the simple search UI present at https://samples.earth. I'm using the "samplesearth" namespace for this work. Feel free to clone it or make your own namespace if we want to load a graph and test it.

Goals

The goals here are to provide a setup where we can fulling round trip the testing. We can:

  • Explore the generation of JSON-LD metadata graphs for a samples by some of the providers
  • Test workflows to publish these JSON-LD data graphs to landing pages for the samples
  • Evaluate approaches leveraging robots.txt and sitemap.xml to expose these resources and basic modification dates and access control approaches
  • Harvest these exposed resources to a graph with some optional validation approaches
  • Load the generated graph into a graph database for testing
  • Test some simple searches (just some simple stuff at the samples.earth site for now)

References

---
minio:
  address: 192.168.86.45
  port: 32768
  accessKey: youraccesskey
  secretKey: yoursecretkey
  ssl: false
  bucket: gleaner
gleaner:
  runid: sprint1 # this will be the bucket the output is placed in...
  summon: true # do we want to visit the web sites and pull down the files
  mill:  true
context:
  cache: true
contextmaps:
- prefix: "https://schema.org/"
  file: "./jsonldcontext.jsonld"
- prefix: "http://schema.org/"
  file: "./jsonldcontext.jsonld"
summoner:
  after: ""  
  mode: diff  # [time, hash, none] diff: look for difference or full: delete old objects and replace
millers:
  graph: true
  shacl: false
  #geojson: false
shaclservice:
  url: https://1bzh4a0lbd.execute-api.us-east-1.amazonaws.com/dev/verify 
shapefiles:
    #- ref: https://raw.githubusercontent.com/geoschemas-org/geoshapes/master/shapegraphs/googleRequired.ttl
    #- ref: https://raw.githubusercontent.com/geoschemas-org/geoshapes/master/shapegraphs/googleRecommendedCoverageCheck.ttl
- ref: https://gist.githubusercontent.com/fils/77d40a917a7af8020693678be30e87dd/raw/f2480cd2f304f2bc74143d4a72e72d43727942c4/westerntropicalshape.ttl
sources:
- name: samplesearth
  url: https://samples.earth/sitemap.xml   # XML version available?
  headless: false
Clone this wiki locally