Living Atlas Pipelines extensions


	Travis build for ala-dev branch
	GBIF Jenkins build for dev branch
	Sonar

This module is to add functionality required by the Living Atlases to facilitate the replacement to biocache-store for data ingress.

Architecture

For details on the GBIF implementation, see the pipelines github repository. This project is focussed on extensions to that architecture to support use by the Living Atlases.

Above is a representation of the data flow from source data in Darwin core archives supplied by data providers, to the API access to these data via the biocache-service component.

Within the "Interpreted AVRO" box is a list of "transforms" each of which take the source data and produce an isolated output in a AVRO formatted file.

GBIF's pipelines already supports a number of core transforms for handling biodiversity occurrence data. The Living Atlas pipelines extensions make us of these transforms "as-is" where possible and extend existing transforms where required.

For information on how the architecture between the legacy system biocache-store and pipelines differ, see this page.

Dependent projects

The pipelines work has necessitated some minor additional API additions and change to the following components:

biocache-service

pipelines branch A version 3.x of biocache-service is in development. This will not use Cassandra for storage of occurrence records, but Cassandra is still required for the storage of user assertions and query identifiers (used to store large query parameters such as WKT strings).

ala-namematching-service

A simple drop wizard wrapper around the ala-name-matching library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.

ala-sensitive-data-service

A simple drop wizard wrapper around the ala-sensitive-data-service library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.

Getting started

Ansible scripts have been developed and are available here, Below are some instructions for setting up a local development environment for pipelines. These steps will load a dataset into a SOLR index.

Software requirements:

Java 8 - this is mandatory (see GBIF pipelines documentation)
Maven needs to run with OpenSDK 1.8 'nano ~/.mavenrc' add 'export JAVA_HOME=[JDK1.8 PATH]'
Docker Desktop
lombok plugin for intelliJ needs to be installed for slf4 annotation
Install docopts using the prebuilt binary option
Install yq via Brew (brew install yq)
Optionally install the avro-tools package via Brew (brew install avro-tools)

Setting up la-pipelines

Download shape files from here and expand into /data/pipelines-shp directory
Download a test darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip)
Create the following directory /data/pipelines-data
Build with maven mvn clean package

Running la-pipelines

Start required docker containers using

docker-compose -f pipelines/src/main/docker/ala-name-service.yml up -d
docker-compose -f pipelines/src/main/docker/solr8.yml up -d
docker-compose -f pipelines/src/main/docker/ala-sensitive-data-service.yml up -d

Note ala-sensitive-data-service.yml can be ommited if you don't need to run the SDS pipeline but you'll need to add

index:
  includeSensitiveDataChecks: false

to the file configs/la-pipelines-local.yaml.

cd scripts
To convert DwCA to AVRO, run ./la-pipelines dwca-avro dr893
To interpret, run ./la-pipelines interpret dr893 --embedded
To mint UUIDs, run ./la-pipelines uuid dr893 --embedded
(Optional) To sample run:
1. ./la-pipelines sample dr893 --embedded
To setup SOLR:
1. Run cd ../solr/scripts and then run ' ./update-solr-config.sh
2. Run cd ../../scripts
To create index avro files, run ./la-pipelines index dr893 --embedded
To generate the SOLR index, run ./la-pipelines solr dr893 --embedded
Check the SOLR index has records in the index by visiting http://localhost:8983/solr/#/biocache/query and clicking the "Execute query" button. It should show a non-zero number for numFound in the JSON response.
Run ./la-pipelines -h for help and more steps:

LA-Pipelines data ingress utility.

The la-pipelines can be executed to run all the ingress steps or only a few of them:

Pipeline ingress steps:

    ┌───── do-all ───────────────────────────────────────────────┐
    │                                                            │
dwca-avro --> interpret --> validate --> uuid --> image-sync ... │
  --> image-load --> sds --> index --> sample --> jackknife --> solr
(...)

Integration Tests

Tests follow the GBIF/failsafe/surefire convention. All integration tests have a suffix of "IT". All junit tests are ran with mvn package and integration tests are ran with mvn verify.

The integration tests will automatically start docker containers for the following:

SOLR
Elastic
Name matching service
SDS

Code style and tools

For code style and tool see the recommendations on the GBIF pipelines project. In particular, note the project uses Project Lombok, please install Lombok plugin for Intellij IDEA.

avro-tools is recommended to aid to development for quick views of AVRO outputs. This can be installed on Macs with Homebrew like so:

brew install avro-tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Living Atlas Pipelines extensions

Architecture

Dependent projects

biocache-service

ala-namematching-service

ala-sensitive-data-service

Getting started

Software requirements:

Setting up la-pipelines

Running la-pipelines

Integration Tests

Code style and tools

Files

README.md

Latest commit

History

README.md

File metadata and controls

Living Atlas Pipelines extensions

Architecture

Dependent projects

biocache-service

ala-namematching-service

ala-sensitive-data-service

Getting started

Software requirements:

Setting up la-pipelines

Running la-pipelines

Integration Tests

Code style and tools