A collection of utilities for reading and modifying EOL graph databases.
Clone the repository.
Copy config/config2.sample.yml
to config/config2.yml
and modify as follows to make it
appropriate for your local plotter
installation.
- Configure neo4j URL, user name, password, and graphdb name
- Set
locations: workspace: path:
to a local directory where the plotter scripts can put results and intermediate files. Personally I set this to directory.plotter_workspace
in my home directory. - Set
prod_pub: token_file:
to be the local file that contains (or will contain) a token for an EOL v3 API endpoint; similarly forbeta_pub
. (see note, below)- Obtain a read token for the graphdb, and if you'll be writing as well, also obtain a write token. The write token must be associated with an admin account; a read token can be associated with any account but using a non-admin account is a bit more secure. If using an admin account the read and write tokens will be the same.
- To obtain a token, go to
https://eol.org/services/authenticate
orhttps://beta.eol.org/services/authenticate
as the case may be (see API documentation). - Put the token into a file.
- Set
prod_pub: token_file:
(orbeta_pub: token_file:
orprod_pub: update_token_file:
etc.) to be the path to that file.
- Configure the staging server
locations: staging:
. The directory on the staging server has to be writable via thersync
command, which is used by plotter scripts, and it has to be readable by the Neo4j server(s) via HTTP. The examples inconfig2.sample.yml
should provide guidance on how to set these variables.rsync_command
specifies thersync
-like command to use to transfer local files to the staging server (this string does not include the source or target).
If not specified, defaults torsync -av
.rsync_specifier
should be either a local path orhost:path
as would be understood by anssh
command.host
is either a DNS name or a name configured in~/.ssh/config
. If a local path, then neo4j should be running locally and the path should be the path to neo4j'simport
directory. If remote, the directory should be one that is exposed by that host's HTTP server.- (It may be possible to use the neo4j
import
directory withfile:///
URLs, but I haven't tried it.) url
gives the prefix designating thersync_specifier
directory when accessed using HTTP. It will be used in the URLs that will occur in neo4jLOAD CSV
commands.- In order for
rsync
to work, appropriatessh
credentials have to be in place. They can be specified in~/.ssh/config
or with an -I argument in thersync_command
.
Most of the rake
commands require a CONF=
parameter to specify
which publishing and/or graphdb server is to be contacted (and, in
turn, which EOL repository (content) server is to be consulted). The
graphdb choices are listed in config2.yml
but are typically test
(for a private testing instance), beta
(EOL beta instance), or
prod
(EOL production instance).
As a simple first test of installation, try
rake resource:info CONF=beta ID=40
which will display a bunch of information about resource 40 on the beta publishing instance.
Plotter can generate a zip file with a dump of all trait records. See doc/trait-bank-dumps.md.
Plotter can do "branch painting" or inference of traits through the taxonomic hierarchy. See doc/branch-painting.md.
There is a script for copying resource metadata into the graphdb. For example:
rake traits:sync_resource_metadata CONF=prod
ensures that there is a neo4j Resource
node for every resource known
to the production publishing server, and sets the resource_id
,
name
, description
, and repository_id
property of each.
Normally it won't be necessary to invoke this script since this information is also transferred on a 'publish'.
There are scripts for adding ranks (see rake --tasks hierarchy:sync_metadata
)
and vernacular names (see rake --tasks vernaculars
)
to the graphdb. It is also possible load a dynamic hierarchy into a test instance
(see rake --tasks hierarchy
).
There are some general scripts for manipulating CSV files, and in particular for comparing them ('delta' or 'diff'). See doc/csv.md.
See rake --tasks resource
.
Workspace root comes from config2.yml (via system.rb). Default is /home/jar/.plotter_workspace, which should be changed.
Sometimes the choice of location is a bit arbitrary or even wrong.
In general artifacts that depend only on the repository goes in
xxx_repo
(for xxx = prod, beta, test); if it depends on the
additionally on the publishing relational database it goes in
xxx_pub
; and if it depends additionally on the graphdb it goes in
xxx
.
(workspace root)/ dwca/ NNN/ ... one directory per DwCA ... id = final 8 chars of uuid properties.json ... metadata for this DwCA dwca.zip or dwca.tgz unpacked: meta.xml (all the other .tsv or .csv files) prod/ resources.csv prod_pub/ resources/ ID/ ... one per publishing resource id inferences/ - temporary directory for intermediate files inferences.csv prod_repo/ resources/ REPO_ID/ ... one per repository resource id page_id_map.csv pages/ accepted.csv.chunks/ beta/ beta_repo/ ... same structure as or production ...
An 'instance' is a triple (graphdb, publishing instance, repository instance).
Lots of things to test. For end to end tests we need to look at:
- painter - branch painting
- traits_dumper - copy traits from graphdb to a set of files
- traits_loader - inverse of traits_dumper
- resource - copy vernaculars into graphdb
- cypher - run a single cypher query
- hierarchy - dynamic hierarchy load, diff, and patch
- instance - flush caches (this has fallen into disrepair)
The 'concordance' feature is not currently working.
- Make sure you have a version of the
bundle
command with version >= 2.2.10. I didgem install bundler
followed bybundle
to accomplish this. - Make a new configuration file with
cp config/config2.sample.yml config/config2.yml
. - Add details to
config/config2.yml
based on what you did for your previous config fileconfig/config.yml
. The structure and syntax are slightly different but overall the details are mostly the same. I hope that this step is self-evident. - Traits dumps now go in workspace subdirectory
prod/trait_dumps
(for production). Modify scripts for this new location as necessary. - Traits dump temp files will now go in workspace subdirectory
prod/trait_dumps/tmp/
- Be sure to include
R
among the rsync flags (seersync_command:
)