Skip to content

saved-models/rap-service

Repository files navigation

SAVED RAP service: model validation pipeline

SAVED: Sustainable Aquaculture: Validating Sea Lice Dispersal [models]

SAVED is a SAIC-funded effort led by the Scottish Government Marine Directorate with academic and industrial partners. The aim is to develop a standardised way to validate sea lice dispersal models.

About RAP

"RAP" stands for reproducible analytical pipeline. This is a term commonly used in the civil service and public sector, and it is useful as it is largely self-explanatory and succint. The civil service's conception of RAP is a set of working practices, especially emphasising use of open source tools (e.g. R) and collaboration (version control, perhaps continous integration). Although typically referring to statistics production and interpretation, our use here is somewhat more expansive. Let's RAP!

This repository hosts the pipeline component of the RAP, which we're using to validate dispersal models. Specificaly for SAVED, we apply RAP to the entire set of tooling which we developed, which is a pipeline from start to finish:

  1. Our data model/ontology, which we developed to describe data in an agreed, common way;
  2. Our local Python upload utilities, (fisdat(1) and fisup(1)), which let us validate data against schemata written in YAML (using LinkML);
  3. This model validation pipeline program (the "RAP service"), implemented using Elixir and Erlang/OTP, which we use to validate dispersal model output against observations, such as the 2011-2013 sentinel cages sampling exercise.

Pipeline technical design

Input data are RDF job descriptions prepared using our data upload utilities. The RDF structure of these is designed to be general enough to be applicable to a variety of different job types and data shape/format. Jobs are external scripts/programs with a common calling convention.

The pipeline is written in Elixir, which is a fairly new programming language implemented on top of Erlang/OTP. The pipeline uses the GenStage library. This worked quite well in practice as stages are processes running on the Erlang BEAM virtual machine, and GenStage provides the machinery to handle back-pressure and demand in the way that OTP provides the machinery to handle message-handling and fault tolerance. Additionally, the RDF tooling is fairly mature and worked really well, especially its mapping between RDF data schemas and Elixir structs, since functional programming languages like Elixir and Erlang are declarative.

Modelling work and results

As well as model validation results, the pipeline outputs a description of processing or work done by the pipeline, as an RDF graph. This uses the PROV ontology, which is particularly neat, as its semantics map remarkably well to Elixir and Erlang/OTP. Specifically:

  1. PROV Agents (specifically, SoftwareAgents) map closely to GenStage's stages, as well as the pipeline OTP application. It may apply even more generally than this, probably to GenServer, and perhaps any process running on the Erlang BEAM VM.
  2. PROV Activities model work/processing done by stages on an event, as well as invocation of stages and the pipeline OTP application.
  3. PROV Entities model final output produced by a pass through the pipeline of a submitted data manifest, in addition to output produced by individual stages, and results of jobs.

Output is 'baked' into a web page, which is the primary way that end-users receive feedback. This web page describes data which were submitted, and results and any descriptive statistics are visualised, depending on the job type.

Pipeline demo

We have a demo running, kindly hosted on a machine in Edinburgh. The anticipation is that this forms a part of a data catalogue. If you wish to submit data to the demo, please get in touch with the maintainer of this repository, at the email address against which commits were made.

saved_fisdat saved_rap

Bibliography

W. Waites, P. Gillibrand, T. Adams, D. Guthrie, C. Revie, and M. Moriarty, Infection pressure on fish in cages, 2024. Pre-print

Software components and licensing

  • This RAP service/model validation pipeline is licensed under AGPL v3 or later. The boilerplate file under lib/manifest/vocabulary.ex was derived from https://github.com/marcelotto/rdf_vocab and is thus licensed under "MIT" likewise.
  • Vocabulary imports are flat files derived from various ontologies, and are provided under their own licenses respectively.
  • The Python density count ODE model included as contrib/bin/density_count_ode.py is licensed as GPL v3 or later.
  • Project documentation, (which, in this repository proper, includes the README file and included diagrams) is under the CC-BY-SA license, since the U.K. does not have a concept of public domain. Modifications should be under this license and provide attribution.
  • The generated web pages use two JavaScript libraries, which are included as assets: D3 and Plotly. Both are licened under permissable licenses: D3 under ISC, and Plotly under "MIT".