Travis build for ala-dev branch | |
GBIF Jenkins build for dev branch | |
Sonar |
This module is to add functionality required by the Living Atlases to facilitate the replacement to biocache-store for data ingress.
For details on the GBIF implementation, see the pipelines github repository. This project is focussed on extensions to that architecture to support use by the Living Atlases.
Above is a representation of the data flow from source data in Darwin core archives supplied by data providers, to the API access to these data via the biocache-service component.
Within the "Interpreted AVRO" box is a list of "transforms" each of which take the source data and produce an isolated output in a AVRO formatted file.
GBIF's pipelines already supports a number of core transforms for handling biodiversity occurrence data. The Living Atlas pipelines extensions make us of these transforms "as-is" where possible and extend existing transforms where required.
For information on how the architecture between the legacy system biocache-store and pipelines differ, see this page.
The pipelines work has necessitated some minor additional API additions and change to the following components:
pipelines branch A version 3.x of biocache-service is in development. This will not use Cassandra for storage of occurrence records, but Cassandra is still required for the storage of user assertions and query identifiers (used to store large query parameters such as WKT strings).
A simple drop wizard wrapper around the ala-name-matching library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.
A simple drop wizard wrapper around the ala-sensitive-data-service library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.
Ansible scripts have been developed and are available here, Below are some instructions for setting up a local development environment for pipelines. These steps will load a dataset into a SOLR index.
- Java 8 - this is mandatory (see GBIF pipelines documentation)
- Maven needs to run with OpenSDK 1.8 'nano ~/.mavenrc' add 'export JAVA_HOME=[JDK1.8 PATH]'
- Docker Desktop
- lombok plugin for intelliJ needs to be installed for slf4 annotation
- Install
docopts
using the prebuilt binary option - Install
yq
via Brew (brew install yq
) - Optionally install the
avro-tools
package via Brew (brew install avro-tools
)
- Download shape files from here and expand into
/data/pipelines-shp
directory - Download a test darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip)
- Create the following directory
/data/pipelines-data
- Build with maven
mvn clean package
- Start required docker containers using
Note
docker-compose -f pipelines/src/main/docker/ala-name-service.yml up -d docker-compose -f pipelines/src/main/docker/solr8.yml up -d docker-compose -f pipelines/src/main/docker/ala-sensitive-data-service.yml up -d
ala-sensitive-data-service.yml
can be ommited if you don't need to run the SDS pipeline but you'll need to addto the fileindex: includeSensitiveDataChecks: false
configs/la-pipelines-local.yaml
. cd scripts
- To convert DwCA to AVRO, run
./la-pipelines dwca-avro dr893
- To interpret, run
./la-pipelines interpret dr893 --embedded
- To mint UUIDs, run
./la-pipelines uuid dr893 --embedded
- (Optional) To sample run:
./la-pipelines sample dr893 --embedded
- To setup SOLR:
- Run
cd ../solr/scripts
and then run './update-solr-config.sh
- Run
cd ../../scripts
- Run
- To create index avro files, run
./la-pipelines index dr893 --embedded
- To generate the SOLR index, run
./la-pipelines solr dr893 --embedded
- Check the SOLR index has records in the index by visiting http://localhost:8983/solr/#/biocache/query and clicking the "Execute query" button. It should show a non-zero number for
numFound
in the JSON response. - Run
./la-pipelines -h
for help and more steps:
LA-Pipelines data ingress utility.
The la-pipelines can be executed to run all the ingress steps or only a few of them:
Pipeline ingress steps:
┌───── do-all ───────────────────────────────────────────────┐
│ │
dwca-avro --> interpret --> validate --> uuid --> image-sync ... │
--> image-load --> sds --> index --> sample --> jackknife --> solr
(...)
Tests follow the GBIF/failsafe/surefire convention.
All integration tests have a suffix of "IT".
All junit tests are ran with mvn package
and integration tests are ran with mvn verify
.
The integration tests will automatically start docker containers for the following:
- SOLR
- Elastic
- Name matching service
- SDS
For code style and tool see the recommendations on the GBIF pipelines project. In particular, note the project uses Project Lombok, please install Lombok plugin for Intellij IDEA.
avro-tools
is recommended to aid to development for quick views of AVRO outputs.
This can be installed on Macs with Homebrew like so:
brew install avro-tools