Skip to content

Getting Started As a Developer

Sierra Moxon edited this page Feb 21, 2018 · 6 revisions

Language

This repository is written in python (3.0). It accesses neo4j directly, using "load csv" and/or "unwind" as necessary. It is written to be able to run a "test" set of data, or the entire data load set, depending on the make command that is chosen. Please see the README in this repo for how to remove the db, build the entire data set or build a test set.

General Code Structure

The purpose of the agr_loader repository is to push data from the participating organizations into a data store that is regenerated completely on each build.

src/

holds all the loader code

schemas/

is a submodule of the agr_schemas repo and is used to validate the loading files on load.

src/fetch_index.py

has a name leftover from the prototype where this was the control for generating the ES index from the data files. Now it is the control of the load scripts only.

It executes 3 main routines: create_indicies - makes the indexes in the neo4j data store load_from_ontologies - loads up the ontologies used in the rest of the data load_from_mods - loads up the MOD data from S3 including the GAF files from GO

These three methods are found in aggregate_loader.py

src/aggregate_loader.py

contains the control structures for executing the load. Start here when you want to add a new data load.

src/mods

species specific classes that inherit from MOD.py - the base MOD class.

src/files

generic methods for parsing different kinds of files

src/extractors

extractors for each data source, that pass their data in maps to the appropriate src/loaders/ loader.

src/loaders

just containers that initiate a transaction (via a Transaction object), and pass the data onto the src/loaders/transactions

src/loaders/transactions

each holds the neo4j query that loads the data directly from its python map.