diff --git a/DVC.md b/DVC.md index 93d26f8..b39106a 100644 --- a/DVC.md +++ b/DVC.md @@ -1,13 +1,25 @@ # Data Version Control -We're trying DVC (Data Version Control) in this project, for versioning data and ML models. +We tried out [DVC (Data Version Control)](https://dvc.org/) in this project, for versioning data and ML models. -There's little here on the DVC side as yet - links and notes in the README about following the approach being used here for LLM testing and fine-tuning, and how we might set it up to manage the collection "externally" (keeping the data on s3 and the metadata in source control). +* Manage image collections as "external" sources (keeping the data on s3 and the metadata held in a git repository). +* Create simple reproducible pipelines for processing data, training and fine-tuning models +* Potential integration with [CML](https://cml.dev/doc/cml-with-dvc) for "continuous machine learning" - see the [llm-eval](https://github.com/NERC-CEH/llm-eval) project for a properly developed take on this. -Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share a lot of the same aims, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature. +Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share some of the same aims as DVC, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature. + +## Summary + +Our data transfer to s3 storage is being [managed via an API](PIPELINES.md) and we don't have frequent changes to the source data. Keeping the `dvc.lock` projects in git and using `dvc` to synchronise training data download between development machines and hosts in JASMIN is a good pattern for other projects, but not for us here. + +The data pipeline included here is minimal (just a chain of scripts!). We wanted to show several different image collections and resulting models trained on their embeddings. `dvc repro` wants to destroy and recreate directories used as input/output between stages, so those have been commented out of the [example dvc.yaml](scripts/dvc.yaml). + +For publishing an experiment, reproducible as a pipeline with a couple of commands and with _little to no adaptation of existing code_ needed to get it to work, it's a decent fit. ## Walkthrough +### Setting up a "DVC remote" in object storage + Following the [DVC Getting Started](https://github.com/iterative/dvc.org/blob/main/content/docs/start/index.md) ``` @@ -108,9 +120,16 @@ Add a script that fits a K-means model from the image embeddings and saves it (h `dvc stage add -n cluster -d ../vectors -o ../models cluster.py` -`dvc repro` at this point does want to run the image embeddings again, it's not clear why... code change? +`dvc repro` at this point does want to run the image embeddings again. + + +## References + +* [DVC with s3](https://github.com/NERC-CEH/llm-eval/blob/main/dvc.md) condensed walkthrough as part of the LLM evaluation project - complete this up to `dvc remote modify...` to set up the s3 connection. +* [Tutorial: versioning data and models: What's next?](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#whats-next) +* [Importing external data: Avoiding duplication](https://dvc.org/doc/user-guide/data-management/importing-external-data#avoiding-duplication) - is it this pattern? diff --git a/PIPELINES.md b/PIPELINES.md index c63fe59..3a56c44 100644 --- a/PIPELINES.md +++ b/PIPELINES.md @@ -33,28 +33,13 @@ The pipeline consists of the following Luigi tasks: - **Purpose**: A wrapper task that runs all the above tasks in sequence. - **Dependencies**: It manages the dependencies and order of execution of the entire pipeline. -## Prerequisites - -- Python 3.7 or above -- The following Python packages: - - `luigi` - - `pandas` - - `numpy` - - `scikit-image` - - `requests` - - `pytest` (for testing) - - `boto3` (for S3 interactions) - - `aioboto3` (for async S3 interactions) - - `fastapi` and `uvicorn` (for the external API) ## Setup and Installation -1. **Clone the Repository** +1. **Installation and dependencies** + +Follow the [main README][README.md] to create a python environment and install our dependencies into it - ```bash - git clone https://github.com/your_username/plankton_pipeline_luigi.git - cd flowcam-pipeline - ``` 2. **Setup JASMIN credentials** @@ -68,6 +53,17 @@ The pipeline consists of the following Luigi tasks: ## Running the pipeline +0. **Start the object store API** + +The pipeline uses the separate [object_store_api](https://github.com/NERC-CEH/object_store_api/) to manage data in s3. + +Please see the README in that project for different modes of running it. Shortest version is: + +* `git clone https://github.com/NERC-CEH/object_store_api.git` +* `pip install -e .[all]` +* Add `.env` file with your credentials to object storage as above +* `fastapi run --workers 4 src/os_api/api.py` + 1. **Start the Luigi Central Scheduler** Path to `--logdir` is optional, if you don't have permissions to write to `/var/log` diff --git a/README.md b/README.md index 5e48882..c5c113c 100644 --- a/README.md +++ b/README.md @@ -57,10 +57,35 @@ git clone https://github.com/exiftool/exiftool.git export PATH=$PATH:exiftool ``` -### Object store connection +## Object store + +## Connection details `.env` contains environment variable names for S3 connection details for the [JASMIN object store](https://github.com/NERC-CEH/object_store_tutorial/). Fill these in with your own credentials. If you're not sure what the `AWS_URL_ENDPOINT` should be, please reach out to one of the project contributors listed below. +## Object store API + +The [object_store_api](https://github.com/NERC-CEH/object_store_api) project provides a web-based API to help manage your image data, for use with JASMIN's s3 store. + +Please [see its documentation](https://github.com/NERC-CEH/object_store_api) for different modes of running the API. The simplest, for single user / testing purposes is: + +`python src/os_api/api.py` + +## Pipelines + +### DVC + +Please see [DVC.yaml] for notes and walkthroughs on different ways of using [Data Version Control](https://dvc.org/) both to manage data within a git repository, and to manage sets of scripts as a reproducble pipeline with minimal intervention. + +This _very basic_ setup has several stages - build an index of images in an object store (s3 bucket), extract and store their embeddings using a pre-trained neural network, and train and save a classifier based on the embeddings. + +`cd scripts` +`dvc repro` + +## Luigi + +Please see [PIPELINES.md](PIPELINES) for detailed documentation about a pipeline that slices up images exported from a FlowCam instrument, adds spatial and temporal metadata into their EXIF headers based on a directory naming convention agreed with researchers, and uploads them to object storage. + ### Running tests @@ -68,9 +93,6 @@ export PATH=$PATH:exiftool ## Contents -### Catalogue creation - -`scripts/intake_metadata.py` is a proof of concept that creates a configuration file for an [intake](https://intake.readthedocs.io/en/latest/) catalogue - a utility to make reading analytical datasets into analysis workflows more reproducible and less effortful. ### Feature extraction @@ -99,17 +121,9 @@ streamlit run src/cyto_ml/visualisation/app.py The demo should automatically open in your browser when you run streamlit. If it does not, connect using: http://localhost:8501. -### Object Store API - -See the [Object Store API](https://github.com/NERC-CEH/object_store_api) project - RESTful interface to manage a data collection held in s3 object storage. - -## Data Version Control -* [DVC with s3](https://github.com/NERC-CEH/llm-eval/blob/main/dvc.md) condensed walkthrough as part of the LLM evaluation project - complete this up to `dvc remote modify...` to set up the s3 connection. -* [Tutorial: versioning data and models: What's next?](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#whats-next) -* [Importing external data: Avoiding duplication](https://dvc.org/doc/user-guide/data-management/importing-external-data#avoiding-duplication) - is it this pattern? DAG / pipeline elements