Add the Luigi pipeline for image decollaging #40

metazool · 2024-10-04T12:10:51Z

Adds the pipeline from the temporary project at https://github.com/NERC-CEH/plankton_pipeline_luigi/ plus its docs, and bugfixes for the associated tests.

Replace the copies of utility functions with imports
Avoid use of relative paths for stage outputs
Make sure the Task functions have return types (ruff check which we have in the CI pipeline, following dri-cd checks for this)

I'll leave comments about the specific test issues against the related code, a couple of them were quite gritty

This is more FYI @albags as it's originally your code! I'm happy to merge this and for you to remove the standalone project. #38 has more notes on the benefits we get from keeping it inside this one, I'm happy to argue about it there :D

metazool · 2024-10-04T12:12:37Z

src/cyto_ml/data/decollage.py

-        """Not very lovely single function that replaces the work of the script."""
+        """Not very lovely single function that replaces the work of the script.
+        See cyto_ml.pipeline.pipeline_decollage - has the same code in it
+        """


I looked at this and thought it wasn't worth refactoring any further - the logic belongs in the pipeline task and this FlowCamSession class was a placeholder for it.

We could also just delete this version...

metazool · 2024-10-04T12:18:37Z

tests/test_pipeline.py

+
+def test_read_metadata(temp_dir):
+    # Create a mock .lst file for testing
+    lst_file_content = "001\nnum-fields|value\n"


So the issue here was intricacies of the "csv-like" file format that the FlowCam instrument exports with its data. It has a meaningless numeric header, and 53 lines of description of header datatypes before the actual, CSV-readable data kicks in. The parsing function assumes this and fails if it's not the case. The "fix" here is to flesh out the test fixture until it matches the format expectations

I would tend not to try and construct a test fixture in code here (I can see why you would for a simpler data structure) but use sample data inside tests/fixtures/ for any format with a bit of complexity. There is a cut-down .lst file lying around in the test fixtures for this project.

Perhaps there are good reasons not to! E.g. to deliberately break the fixtures in case the instrument outputs bad metadata, and see how our code handles that. I know my test style tends to be "happy path" ...
I've left it as-is though, just padded it out with dummy data until it matches what the microscope generates

metazool · 2024-10-04T12:27:13Z

tests/test_pipeline.py

+        out.write("blah")
+    # The task `requires` DecollageImages, but that requires other tasks, which run first
+    # Rather than mock its output, or the whole chain, require a mock task that replaces it 
+    mock_output = mocker.patch(f'cyto_ml.pipeline.pipeline_decollage.UploadDecollagedImagesToS3.requires')


This one was more interesting! As it stood previously the code had mocked output for the pipeline task that this one requires

That output was a str where a subclass of luigi.Target was required (and the task runner triggered a failure because of that). Tried replacing the output with a MockTarget but then all the sub-tasks that DecollageImages has in its requires function started to run before it ever got to returning its mocked output!

So I did this instead, replaced the whole requires chain with a single MockTask that returns an arbitrary file output, which we create first.

This was a good path into understanding Luigi! These complex dependencies with different data stages are hard to unit test like this, I'm impressed by the subtlety with which you did it. In your place I'd probably have written a big baggy "integration"-style test running the whole task graph and using moto-server, and that would have been much less efficient :D

albags

Thanks for putting this code here and fix the tests.

metazool added 11 commits October 3, 2024 16:16

Add in the dependencies and documentation from spin-out project

071377b

Add the pipeline code - lint and tests expected to fail

833e175

fix the read_metadata test - format assumptions, file location

bc1829a

tests all functioning, now to fix imports and lint

da4b4e2

ruff check --format plus some return types

3667f4a

add pipeline dependencies to [all] in pyproject.toml

81c282c

remove user directory from demo phase of script

4e278e4

add pipeline dependencies to the action too, sigh

8302643

remove unused s3 client, and 'https' from default .env

97ddbc6

tweak to docstring

e504190

distinct pytest-mock dependency

5c8b980

metazool requested a review from albags October 4, 2024 12:11

metazool commented Oct 4, 2024

View reviewed changes

metazool merged commit c8422e9 into main Oct 4, 2024
2 checks passed

albags reviewed Oct 4, 2024

View reviewed changes

metazool mentioned this pull request Oct 7, 2024

Object store API for the plankton store in JASMIN #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the Luigi pipeline for image decollaging #40

Add the Luigi pipeline for image decollaging #40

metazool commented Oct 4, 2024

metazool Oct 4, 2024

metazool Oct 4, 2024

metazool Oct 4, 2024 •

edited

Loading

albags left a comment

Add the Luigi pipeline for image decollaging #40

Add the Luigi pipeline for image decollaging #40

Conversation

metazool commented Oct 4, 2024

metazool Oct 4, 2024

Choose a reason for hiding this comment

metazool Oct 4, 2024

Choose a reason for hiding this comment

metazool Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

albags left a comment

Choose a reason for hiding this comment

metazool Oct 4, 2024 •

edited

Loading