ETL pipeline for Alliance of Genome Resources
- Docker
- Docker-compose
- AWS access keys (project is agr_aws). Please contact Stuart Myasato or Olin Blodgett for permission on the AWS project. (below are instructions for getting a login token and pulling the base images after you have access keys).
-
Build the local image with
make build
. -
Start the Neo4j database with
make startdb
. Allow ~10 seconds for Neo4j to initialize.- To initialize an empty database after previously using the loader, be sure to run
make removedb
before runningmake startdb
.
- To initialize an empty database after previously using the loader, be sure to run
-
ensure that your local docker installation has access to at least 5G (preferentially 8G) of memory or else your run_test target will fail with a non-inituative error that "Cannot resolve address 'neo4j'" this can be done in the docker preferences.
- Initialize a full load with
make run
. - Alternatively,
make run_test
will launch a much smaller test load; this is useful for development and testing.
- Once the loader has been run (either test load or full load), unit tests can be executed via
make unit_tests
.
- From your command line:
docker exec -ti neo4j bin/cypher-shell
- A quick command to count the number of nodes in your db:
match (n) return count (n);
- A quick command to count the number of nodes in your db:
- Remove the database with
make removedb
.
make reload
will re-run theInstallation
andRunning the Loader
steps from above.make reload_test
will re-run the same steps using a test subset of data.- note: reload_test will not re-download the file bolus.
- There are 3 loader configuration files that come with the system (in
src/config
). Each is set up to work on a particular environment (and differs in the default number of threads for both downloading files and the number of threads used to load the database):- test.yml will be used while running the load using the test data set.
- default.yml is the configuration used on all the shared systems and on production.
- develop.yml is used for the full data set on a development system.
Each can be modified to remove or add the data types (ie: Allele, BGI, Expression, etc...) and subtypes (ie: ZFIN, SGD, RGD, etc...) as needed for development purposes.
- When adding a new data load, be sure to add to validation.yml as well so the system knows the expected data types and subtypes.
- local_submission_system.json is a file consumed in addition to the submission system data (from the submission system API) that is used to customize non-submission system files like ontology files.
- ALLIANCE_RELEASE - the release version that this code acts on.
- FMS_API_URL - the host from which this code pulls its available file paths from (submission system host). Note: the submission system host is reliant on the ferret file grabber. That pipeline is responsible for ontologie files and GAF files being up to date. And, the submission system requires a snapshot to be taken to fetch 'latest' files.
- TEST_SCHEMA_BRANCH - If set that branch of the agr_schema wil be used instead of master
- If the site is built with docker-compose, these will be set automatically to the 'dev' versions of all these variables.
AWS ECR uses an token-based authentication system, for which tokens automatically expire after 12 hours. This means frequent authentication is required to access base linux and neo4j env image. To enable this:
- optionally install AWS-CLI locally. Alternatively you can also use the amazon-provided docker image (this is used in the makefile).
To use the docker images for all steps below, replace the
aws
part of all commands withdocker run --rm -it -v ~/.aws:/root/.aws amazon/aws-cli
. - make sure you have AWS login credentials for the agr_aws account, with the permission group - AWS group for ECR access.
- Upon setup, run
aws configure
(which will generate or append/update~/.aws/config
and~/.aws/credentials
) and provide the following details when asked for:- AWS Access Key ID: provide your personal access key ID
- AWS Secret Access Key: provide your personal Secret Access Key (only accessible on access key creation, you may need to regenerate a new access key if you did not store it).
- Default region name:
us-east-1
- Default output format:
<enter>
(accept default)
- To test that your credentials are working correctly, run
aws ecr get-login-password
and verify a token is returned.
The Makefile includes all required recipes and dependencies to automatically perform AWS authentification when required
for any make
target, provided that the above aws cli configurations are in place.
Alternatively, to manually renew the authentication token stored in .docker/config.json
(for custom debugging and development),
you can execute the following make
command (export or pass the AWS_PROFILE
environment variable value as your aws-profile name if you use a named aws profile for agr):
make registry-docker-login
Reminder: authentification needs to be renewed every time you get an error like this (usually ~ every 12 hours):
Error response from daemon: pull access denied for 100225593120.dkr.ecr.us-east-1.amazonaws.com/agr_neo4j_env, repository does not exist or may require 'docker login': denied: Your authorization token has expired. Reauthenticate and try again.