A Ruby toolkit for managing geospatial metadata, including:
- tasks for cloning, updating, and indexing OpenGeoMetdata metadata
- library for converting metadata between standards
Add this line to your application's Gemfile
:
gem 'geo_combine'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install geo_combine
# Create a new ISO19139 object
> iso_metadata = GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml')
# Convert ISO to GeoBlacklight
> iso_metadata.to_geoblacklight
# Convert that to JSON
> iso_metadata.to_geoblacklight.to_json
# Convert ISO (or FGDC) to HTML
> iso_metadata.to_html
$ bundle exec rake geocombine:clone
Will clone all edu.*
, org.*
, and uk.*
OpenGeoMetadata repositories into ./tmp/opengeometadata
. Location of the OpenGeoMetadata repositories can be configured using the OGM_PATH
environment variable.
$ OGM_PATH='my/custom/location' bundle exec rake geocombine:clone
You can also specify a single repository:
$ bundle exec rake geocombine:clone[edu.stanford.purl]
Note: If you are using zsh, you will need to use escape characters in front of the brackets:
$ bundle exec rake geocombine:clone\[edu.stanford.purl\]
$ bundle exec rake geocombine:pull
Runs git pull origin master
on all cloned repositories in ./tmp/opengeometadata
(or custom path with configured environment variable OGM_PATH
).
You can also specify a single repository:
$ bundle exec rake geocombine:pull[edu.stanford.purl]
Note: If you are using zsh, you will need to use escape characters in front of the brackets:
$ bundle exec rake geocombine:pull\[edu.stanford.purl\]
To index into Solr, GeoCombine requires a Solr instance that is running the GeoBlacklight schema:
$ bundle exec rake geocombine:index
Indexes the geoblacklight.json
files in cloned repositories to a Solr index running at http://127.0.0.1:8983/solr
Solr location can also be specified by an environment variable SOLR_URL
.
$ SOLR_URL=http://www.example.com:1234/solr/collection bundle exec rake geocombine:index
Depending on your Solr instance's performance characteristics, you may want to
change the commitWithin
parameter (in milliseconds):
$ SOLR_COMMIT_WITHIN=100 bundle exec rake geocombine:index
GeoCombine provides a Harvester class and rake task to harvest and index content from GeoBlacklight sites (or any site that follows the Blacklight API format). Given that the configurations can change from consumer to consumer and site to site, the class provides a relatively simple configuration API. This can be configured in an initializer, a wrapping rake task, or any other ruby context where the rake task our class would be invoked.
bundle exec rake geocombine:geoblacklight_harvester:index[YOUR_CONFIGURED_SITE_KEY]
Only the sites themselves are required to be configured but there are various configuration options that can (optionally) be supplied to modify the harvester's behavior.
GeoCombine::GeoBlacklightHarvester.configure do
{
commit_within: '10000',
crawl_delay: 1, # All sites
debug: true,
SITE1: {
crawl_delay: 2, # SITE1 only
host: 'https://geoblacklight.example.edu',
params: {
f: {
dct_provenance_s: ['Institution']
}
}
},
SITE2: {
host: 'https://geoportal.example.edu',
params: {
q: '*'
}
}
}
end
Crawl delays can be configured (in seconds) either globally for all sites or on a per-site basis. This will cause a delay for that number of seconds between each search results page (note that Blacklight 7 necessitates a lot of requests per results page and this only causes the delay per page of results)
Solr's commitWithin option can be configured (in milliseconds) by passing a value under the commit_within key.
The harvester and indexer will only puts
content when errors happen. It is possible to see some progress information by setting the debug configuration option.
You may need to transform documents that are harvested for various purposes (removing fields, adding fields, omitting a document all together, etc). You can configure some ruby code (a proc) that will take the document in, transform it, and return the transformed document. By default the indexer will remove the score
, timestamp
, and _version_
fields from the documents harvested. If you provide your own transformer, you'll likely want to remove these fields in addition to the other transformations you provide.
GeoCombine::GeoBlacklightIndexer.document_transformer = -> (document) do
# Removes "bogus_field" from the content we're harvesting
# in addition to some other solr fields we don't want
%w[_version_ score timestamp bogus_field].each do |field|
document.delete(field)
end
document
end
To run the tests, use:
$ bundle exec rake spec
- Fork it ( https://github.com/[my-github-username]/GeoCombine/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request