Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Luigi pipeline for image decollaging #40

Merged
merged 11 commits into from
Oct 4, 2024
Merged

Conversation

metazool
Copy link
Collaborator

@metazool metazool commented Oct 4, 2024

Adds the pipeline from the temporary project at https://github.com/NERC-CEH/plankton_pipeline_luigi/ plus its docs, and bugfixes for the associated tests.

  • Replace the copies of utility functions with imports
  • Avoid use of relative paths for stage outputs
  • Make sure the Task functions have return types (ruff check which we have in the CI pipeline, following dri-cd checks for this)

I'll leave comments about the specific test issues against the related code, a couple of them were quite gritty

This is more FYI @albags as it's originally your code! I'm happy to merge this and for you to remove the standalone project. #38 has more notes on the benefits we get from keeping it inside this one, I'm happy to argue about it there :D

@metazool metazool requested a review from albags October 4, 2024 12:11
"""Not very lovely single function that replaces the work of the script."""
"""Not very lovely single function that replaces the work of the script.
See cyto_ml.pipeline.pipeline_decollage - has the same code in it
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this and thought it wasn't worth refactoring any further - the logic belongs in the pipeline task and this FlowCamSession class was a placeholder for it.

We could also just delete this version...


def test_read_metadata(temp_dir):
# Create a mock .lst file for testing
lst_file_content = "001\nnum-fields|value\n"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue here was intricacies of the "csv-like" file format that the FlowCam instrument exports with its data. It has a meaningless numeric header, and 53 lines of description of header datatypes before the actual, CSV-readable data kicks in. The parsing function assumes this and fails if it's not the case. The "fix" here is to flesh out the test fixture until it matches the format expectations

I would tend not to try and construct a test fixture in code here (I can see why you would for a simpler data structure) but use sample data inside tests/fixtures/ for any format with a bit of complexity. There is a cut-down .lst file lying around in the test fixtures for this project.

Perhaps there are good reasons not to! E.g. to deliberately break the fixtures in case the instrument outputs bad metadata, and see how our code handles that. I know my test style tends to be "happy path" ...
I've left it as-is though, just padded it out with dummy data until it matches what the microscope generates

out.write("blah")
# The task `requires` DecollageImages, but that requires other tasks, which run first
# Rather than mock its output, or the whole chain, require a mock task that replaces it
mock_output = mocker.patch(f'cyto_ml.pipeline.pipeline_decollage.UploadDecollagedImagesToS3.requires')
Copy link
Collaborator Author

@metazool metazool Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was more interesting! As it stood previously the code had mocked output for the pipeline task that this one requires

That output was a str where a subclass of luigi.Target was required (and the task runner triggered a failure because of that). Tried replacing the output with a MockTarget but then all the sub-tasks that DecollageImages has in its requires function started to run before it ever got to returning its mocked output!

So I did this instead, replaced the whole requires chain with a single MockTask that returns an arbitrary file output, which we create first.

This was a good path into understanding Luigi! These complex dependencies with different data stages are hard to unit test like this, I'm impressed by the subtlety with which you did it. In your place I'd probably have written a big baggy "integration"-style test running the whole task graph and using moto-server, and that would have been much less efficient :D

@metazool metazool merged commit c8422e9 into main Oct 4, 2024
2 checks passed
Copy link
Collaborator

@albags albags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this code here and fix the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants