-
Notifications
You must be signed in to change notification settings - Fork 1
Testing samples.earth with Gleaner
Just a quick example of using Gleaner (https://gleaner.io) against the IGSN Sprint test site at https://samples.earth. As things change during the sprint I will update this page.
This is using the latest build from the sitemap_update branch [1] (version 2.0.21). The configuration file I used is at the bottom of this page. I wont detail using gleaner here. You can see https://gleaner.io for that and there is a screen-cast there too of a run with a slightly older version.
Just a dump of the run output from my laptop:
go run ../../cmd/gleaner/main.go -cfg config
main.go:30: EarthCube Gleaner
main.go:89: Validating access to object store
check.go:38: Verfied Gleaner bucket: gleaner.
summoner.go:16: Summoner start time: 2020-05-26 07:51:35.001556173 -0500 CDT m=+0.007901988
resources.go:42: Parsing sitemap: https://samples.earth/sitemap.xml
resources.go:52: map[after: mode:diff]
resources.go:64: Get with no date
resources.go:73: samplesearth : 10010
acquire.go:29: Queuing URLs for samplesearth
samplesearth 2m48s [--------------------------------------------------------------------] 100%
summoner.go:30: Summoner end time: 2020-05-26 07:54:24.867379638 -0500 CDT m=+169.873725573
millers.go:28: Miller start time: 2020-05-26 07:54:24.867455482 -0500 CDT m=+169.873801336
samplesearth 2m48s [--------------------------------------------------------------------] 100%
miller --- [ ] 0%
samplesearth 2m48s [--------------------------------------------------------------------] 100%
miller 1m11s [------------------------------------------------------------------->] 100%
graphng.go:82: Assembling result graph for prefix: summoned/samplesearth to: milled/samplesearth
samplesearth 2m48s [--------------------------------------------------------------------] 100%
miller 1m11s [--------------------------------------------------------------------] 100%
graphng.go:89: Pipe copy for graph done
millers.go:70: Miller end time: 2020-05-26 07:56:01.18513208 -0500 CDT m=+266.191477866
millers.go:71: Miller run time: 1.605295
The results of this run are loaded in the sprint graph at http://graph.openknowledge.network/blazegraph/#splash and are used for the simple search UI present at https://samples.earth. I'm using the "samplesearth" namespace for this work. Feel free to clone it or make your own namespace if we want to load a graph and test it.
The goals here are to provide a setup where we can fulling round trip the testing. We can:
- Explore the generation of JSON-LD metadata graphs for a samples by some of the providers
- Test workflows to publish these JSON-LD data graphs to landing pages for the samples
- Evaluate approaches leveraging robots.txt and sitemap.xml to expose these resources and basic modification dates and access control approaches
- Harvest these exposed resources to a graph with some optional validation approaches
- Load the generated graph into a graph database for testing
- Test some simple searches (just some simple stuff at the samples.earth site for now)
- [1] https://github.com/earthcubearchitecture-project418/gleaner/tree/update_sitemap
- [2] https://github.com/fils/goobjectweb
---
minio:
address: 192.168.86.45
port: 32768
accessKey: youraccesskey
secretKey: yoursecretkey
ssl: false
bucket: gleaner
gleaner:
runid: sprint1 # this will be the bucket the output is placed in...
summon: true # do we want to visit the web sites and pull down the files
mill: true
context:
cache: true
contextmaps:
- prefix: "https://schema.org/"
file: "./jsonldcontext.jsonld"
- prefix: "http://schema.org/"
file: "./jsonldcontext.jsonld"
summoner:
after: ""
mode: diff # [time, hash, none] diff: look for difference or full: delete old objects and replace
millers:
graph: true
shacl: false
#geojson: false
shaclservice:
url: https://1bzh4a0lbd.execute-api.us-east-1.amazonaws.com/dev/verify
shapefiles:
#- ref: https://raw.githubusercontent.com/geoschemas-org/geoshapes/master/shapegraphs/googleRequired.ttl
#- ref: https://raw.githubusercontent.com/geoschemas-org/geoshapes/master/shapegraphs/googleRecommendedCoverageCheck.ttl
- ref: https://gist.githubusercontent.com/fils/77d40a917a7af8020693678be30e87dd/raw/f2480cd2f304f2bc74143d4a72e72d43727942c4/westerntropicalshape.ttl
sources:
- name: samplesearth
url: https://samples.earth/sitemap.xml # XML version available?
headless: false