Documentation-only, focus on use of the object store (#42)

* link the documentation together, remove older parts * note on running the API first in the pipeline docs
NERC-CEH · Nov 5, 2024 · 40e9bf3 · 40e9bf3
1 parent 7d55e6c
commit 40e9bf3
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 34 deletions.
diff --git a/DVC.md b/DVC.md
@@ -1,13 +1,25 @@
 # Data Version Control 
 
-We're trying DVC (Data Version Control) in this project, for versioning data and ML models.
+We tried out [DVC (Data Version Control)](https://dvc.org/) in this project, for versioning data and ML models.
 
-There's little here on the DVC side as yet - links and notes in the README about following the approach being used here for LLM testing and fine-tuning, and how we might set it up to manage the collection "externally" (keeping the data on s3 and the metadata in source control).
+* Manage image collections as "external" sources (keeping the data on s3 and the metadata held in a git repository).
+* Create simple reproducible pipelines for processing data, training and fine-tuning models
+* Potential integration with [CML](https://cml.dev/doc/cml-with-dvc) for "continuous machine learning" - see the [llm-eval](https://github.com/NERC-CEH/llm-eval) project for a properly developed take on this.
 
-Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share a lot of the same aims, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature.
+Other ecologies like [RO-crate](https://www.researchobject.org/ro-crate/) and [outpack](https://github.com/mrc-ide/outpack_server) share some of the same aims as DVC, but are more focused on research data and with possibly more community connections. For ML pipeline projects though, DVC is mature.
+
+## Summary
+
+Our data transfer to s3 storage is being [managed via an API](PIPELINES.md) and we don't have frequent changes to the source data. Keeping the `dvc.lock` projects in git and using `dvc` to synchronise training data download between development machines and hosts in JASMIN is a good pattern for other projects, but not for us here.
+
+The data pipeline included here is minimal (just a chain of scripts!). We wanted to show several different image collections and resulting models trained on their embeddings. `dvc repro` wants to destroy and recreate directories used as input/output between stages, so those have been commented out of the [example dvc.yaml](scripts/dvc.yaml). 
+
+For publishing an experiment, reproducible as a pipeline with a couple of commands and with _little to no adaptation of existing code_ needed to get it to work, it's a decent fit.
 
 ## Walkthrough
 
+### Setting up a "DVC remote" in object storage
+
 Following the [DVC Getting Started](https://github.com/iterative/dvc.org/blob/main/content/docs/start/index.md) 
 
 ```
@@ -108,9 +120,16 @@ Add a script that fits a K-means model from the image embeddings and saves it (h
 
 `dvc stage add -n cluster -d ../vectors -o ../models cluster.py`
 
-`dvc repro` at this point does want to run the image embeddings again, it's not clear why... code change?
+`dvc repro` at this point does want to run the image embeddings again.
+
+
+## References
+
+* [DVC with s3](https://github.com/NERC-CEH/llm-eval/blob/main/dvc.md) condensed walkthrough as part of the LLM evaluation project - complete this up to `dvc remote modify...` to set up the s3 connection.
 
+* [Tutorial: versioning data and models: What's next?](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#whats-next) 
 
+* [Importing external data: Avoiding duplication](https://dvc.org/doc/user-guide/data-management/importing-external-data#avoiding-duplication) - is it this pattern?
 
 
 
diff --git a/PIPELINES.md b/PIPELINES.md
@@ -33,28 +33,13 @@ The pipeline consists of the following Luigi tasks:
    - **Purpose**: A wrapper task that runs all the above tasks in sequence.
    - **Dependencies**: It manages the dependencies and order of execution of the entire pipeline.
 
-## Prerequisites
-
-- Python 3.7 or above
-- The following Python packages:
-  - `luigi`
-  - `pandas`
-  - `numpy`
-  - `scikit-image`
-  - `requests`
-  - `pytest` (for testing)
-  - `boto3` (for S3 interactions)
-  - `aioboto3` (for async S3 interactions)
-  - `fastapi` and `uvicorn` (for the external API)
 
 ## Setup and Installation
 
-1. **Clone the Repository**
+1. **Installation and dependencies**
+
+Follow the [main README][README.md] to create a python environment and install our dependencies into it
 
-   ```bash
-   git clone https://github.com/your_username/plankton_pipeline_luigi.git
-   cd flowcam-pipeline
-   ```
 
 2. **Setup JASMIN credentials**  
 
@@ -68,6 +53,17 @@ The pipeline consists of the following Luigi tasks:
 
 ## Running the pipeline
 
+0. **Start the object store API**
+
+The pipeline uses the separate [object_store_api](https://github.com/NERC-CEH/object_store_api/) to manage data in s3. 
+
+Please see the README in that project for different modes of running it. Shortest version is:
+
+* `git clone https://github.com/NERC-CEH/object_store_api.git`
+* `pip install -e .[all]`
+* Add `.env` file with your credentials to object storage as above 
+* `fastapi run --workers 4 src/os_api/api.py`
+
 1. **Start the Luigi Central Scheduler**
 
 Path to `--logdir` is optional, if you don't have permissions to write to `/var/log`

diff --git a/README.md b/README.md
@@ -57,20 +57,42 @@ git clone https://github.com/exiftool/exiftool.git
 export PATH=$PATH:exiftool
 ```
 
-### Object store connection
+## Object store 
+
+## Connection details
 
 `.env` contains environment variable names for S3 connection details for the [JASMIN object store](https://github.com/NERC-CEH/object_store_tutorial/). Fill these in with your own credentials. If you're not sure what the `AWS_URL_ENDPOINT` should be, please reach out to one of the project contributors listed below. 
 
+## Object store API
+
+The [object_store_api](https://github.com/NERC-CEH/object_store_api) project provides a web-based API to help manage your image data, for use with JASMIN's s3 store.
+
+Please [see its documentation](https://github.com/NERC-CEH/object_store_api) for different modes of running the API. The simplest, for single user / testing purposes is:
+
+`python src/os_api/api.py`
+
+## Pipelines
+
+### DVC 
+
+Please see [DVC.yaml] for notes and walkthroughs on different ways of using [Data Version Control](https://dvc.org/) both to manage data within a git repository, and to manage sets of scripts as a reproducble pipeline with minimal intervention.
+
+This _very basic_ setup has several stages - build an index of images in an object store (s3 bucket), extract and store their embeddings using a pre-trained neural network, and train and save a classifier based on the embeddings.
+
+`cd scripts`
+`dvc repro`
+
+## Luigi
+
+Please see [PIPELINES.md](PIPELINES) for detailed documentation about a pipeline that slices up images exported from a FlowCam instrument, adds spatial and temporal metadata into their EXIF headers based on a directory naming convention agreed with researchers, and uploads them to object storage.
+
 
 ### Running tests
 
 `pytest` or `py.test`
 
 ## Contents
 
-### Catalogue creation
-
-`scripts/intake_metadata.py` is a proof of concept that creates a configuration file for an [intake](https://intake.readthedocs.io/en/latest/) catalogue - a utility to make reading analytical datasets into analysis workflows more reproducible and less effortful.
 
 ### Feature extraction
 
@@ -99,17 +121,9 @@ streamlit run src/cyto_ml/visualisation/app.py
 
 The demo should automatically open in your browser when you run streamlit. If it does not, connect using: http://localhost:8501.
 
-### Object Store API
-
-See the [Object Store API](https://github.com/NERC-CEH/object_store_api) project - RESTful interface to manage a data collection held in s3 object storage.
-
-## Data Version Control
 
-* [DVC with s3](https://github.com/NERC-CEH/llm-eval/blob/main/dvc.md) condensed walkthrough as part of the LLM evaluation project - complete this up to `dvc remote modify...` to set up the s3 connection.
 
-* [Tutorial: versioning data and models: What's next?](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial#whats-next) 
 
-* [Importing external data: Avoiding duplication](https://dvc.org/doc/user-guide/data-management/importing-external-data#avoiding-duplication) - is it this pattern?
 
 DAG / pipeline elements