Skip to content

Commit

Permalink
Documentation-only, focus on use of the object store (#42)
Browse files Browse the repository at this point in the history
* link the documentation together, remove older parts

* note on running the API first in the pipeline docs
  • Loading branch information
metazool authored Nov 5, 2024
1 parent 7d55e6c commit 40e9bf3
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 34 deletions.
27 changes: 23 additions & 4 deletions DVC.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,25 @@
# Data Version Control

We're trying DVC (Data Version Control) in this project, for versioning data and ML models.
We tried out [DVC (Data Version Control)](https://dvc.org/) in this project, for versioning data and ML models.

There's little here on the DVC side as yet - links and notes in the README about following the approach being used here for LLM testing and fine-tuning, and how we might set it up to manage the collection "externally" (keeping the data on s3 and the metadata in source control).
* Manage image collections as "external" sources (keeping the data on s3 and the metadata held in a git repository).
* Create simple reproducible pipelines for processing data, training and fine-tuning models
* Potential integration with [CML](https://cml.dev/doc/cml-with-dvc) for "continuous machine learning" - see the [llm-eval](https://github.com/NERC-CEH/llm-eval) project for a properly developed take on this.

Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share a lot of the same aims, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature.
Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share some of the same aims as DVC, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature.

## Summary

Our data transfer to s3 storage is being [managed via an API](PIPELINES.md) and we don't have frequent changes to the source data. Keeping the `dvc.lock` projects in git and using `dvc` to synchronise training data download between development machines and hosts in JASMIN is a good pattern for other projects, but not for us here.

The data pipeline included here is minimal (just a chain of scripts!). We wanted to show several different image collections and resulting models trained on their embeddings. `dvc repro` wants to destroy and recreate directories used as input/output between stages, so those have been commented out of the [example dvc.yaml](scripts/dvc.yaml).

For publishing an experiment, reproducible as a pipeline with a couple of commands and with _little to no adaptation of existing code_ needed to get it to work, it's a decent fit.

## Walkthrough

### Setting up a "DVC remote" in object storage

Following the [DVC Getting Started](https://github.com/iterative/dvc.org/blob/main/content/docs/start/index.md)

```
Expand Down Expand Up @@ -108,9 +120,16 @@ Add a script that fits a K-means model from the image embeddings and saves it (h

`dvc stage add -n cluster -d ../vectors -o ../models cluster.py`

`dvc repro` at this point does want to run the image embeddings again, it's not clear why... code change?
`dvc repro` at this point does want to run the image embeddings again.


## References

* [DVC with s3](https://github.com/NERC-CEH/llm-eval/blob/main/dvc.md) condensed walkthrough as part of the LLM evaluation project - complete this up to `dvc remote modify...` to set up the s3 connection.

* [Tutorial: versioning data and models: What's next?](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#whats-next)

* [Importing external data: Avoiding duplication](https://dvc.org/doc/user-guide/data-management/importing-external-data#avoiding-duplication) - is it this pattern?



32 changes: 14 additions & 18 deletions PIPELINES.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,28 +33,13 @@ The pipeline consists of the following Luigi tasks:
- **Purpose**: A wrapper task that runs all the above tasks in sequence.
- **Dependencies**: It manages the dependencies and order of execution of the entire pipeline.

## Prerequisites

- Python 3.7 or above
- The following Python packages:
- `luigi`
- `pandas`
- `numpy`
- `scikit-image`
- `requests`
- `pytest` (for testing)
- `boto3` (for S3 interactions)
- `aioboto3` (for async S3 interactions)
- `fastapi` and `uvicorn` (for the external API)

## Setup and Installation

1. **Clone the Repository**
1. **Installation and dependencies**

Follow the [main README][README.md] to create a python environment and install our dependencies into it

```bash
git clone https://github.com/your_username/plankton_pipeline_luigi.git
cd flowcam-pipeline
```

2. **Setup JASMIN credentials**

Expand All @@ -68,6 +53,17 @@ The pipeline consists of the following Luigi tasks:

## Running the pipeline

0. **Start the object store API**

The pipeline uses the separate [object_store_api](https://github.com/NERC-CEH/object_store_api/) to manage data in s3.

Please see the README in that project for different modes of running it. Shortest version is:

* `git clone https://github.com/NERC-CEH/object_store_api.git`
* `pip install -e .[all]`
* Add `.env` file with your credentials to object storage as above
* `fastapi run --workers 4 src/os_api/api.py`

1. **Start the Luigi Central Scheduler**

Path to `--logdir` is optional, if you don't have permissions to write to `/var/log`
Expand Down
38 changes: 26 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,20 +57,42 @@ git clone https://github.com/exiftool/exiftool.git
export PATH=$PATH:exiftool
```

### Object store connection
## Object store

## Connection details

`.env` contains environment variable names for S3 connection details for the [JASMIN object store](https://github.com/NERC-CEH/object_store_tutorial/). Fill these in with your own credentials. If you're not sure what the `AWS_URL_ENDPOINT` should be, please reach out to one of the project contributors listed below.

## Object store API

The [object_store_api](https://github.com/NERC-CEH/object_store_api) project provides a web-based API to help manage your image data, for use with JASMIN's s3 store.

Please [see its documentation](https://github.com/NERC-CEH/object_store_api) for different modes of running the API. The simplest, for single user / testing purposes is:

`python src/os_api/api.py`

## Pipelines

### DVC

Please see [DVC.yaml] for notes and walkthroughs on different ways of using [Data Version Control](https://dvc.org/) both to manage data within a git repository, and to manage sets of scripts as a reproducble pipeline with minimal intervention.

This _very basic_ setup has several stages - build an index of images in an object store (s3 bucket), extract and store their embeddings using a pre-trained neural network, and train and save a classifier based on the embeddings.

`cd scripts`
`dvc repro`

## Luigi

Please see [PIPELINES.md](PIPELINES) for detailed documentation about a pipeline that slices up images exported from a FlowCam instrument, adds spatial and temporal metadata into their EXIF headers based on a directory naming convention agreed with researchers, and uploads them to object storage.


### Running tests

`pytest` or `py.test`

## Contents

### Catalogue creation

`scripts/intake_metadata.py` is a proof of concept that creates a configuration file for an [intake](https://intake.readthedocs.io/en/latest/) catalogue - a utility to make reading analytical datasets into analysis workflows more reproducible and less effortful.

### Feature extraction

Expand Down Expand Up @@ -99,17 +121,9 @@ streamlit run src/cyto_ml/visualisation/app.py

The demo should automatically open in your browser when you run streamlit. If it does not, connect using: http://localhost:8501.

### Object Store API

See the [Object Store API](https://github.com/NERC-CEH/object_store_api) project - RESTful interface to manage a data collection held in s3 object storage.

## Data Version Control

* [DVC with s3](https://github.com/NERC-CEH/llm-eval/blob/main/dvc.md) condensed walkthrough as part of the LLM evaluation project - complete this up to `dvc remote modify...` to set up the s3 connection.

* [Tutorial: versioning data and models: What's next?](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#whats-next)

* [Importing external data: Avoiding duplication](https://dvc.org/doc/user-guide/data-management/importing-external-data#avoiding-duplication) - is it this pattern?

DAG / pipeline elements

Expand Down

0 comments on commit 40e9bf3

Please sign in to comment.