-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Luigi pipeline for image decollaging #40
Conversation
"""Not very lovely single function that replaces the work of the script.""" | ||
"""Not very lovely single function that replaces the work of the script. | ||
See cyto_ml.pipeline.pipeline_decollage - has the same code in it | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this and thought it wasn't worth refactoring any further - the logic belongs in the pipeline task and this FlowCamSession
class was a placeholder for it.
We could also just delete this version...
|
||
def test_read_metadata(temp_dir): | ||
# Create a mock .lst file for testing | ||
lst_file_content = "001\nnum-fields|value\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue here was intricacies of the "csv-like" file format that the FlowCam instrument exports with its data. It has a meaningless numeric header, and 53 lines of description of header datatypes before the actual, CSV-readable data kicks in. The parsing function assumes this and fails if it's not the case. The "fix" here is to flesh out the test fixture until it matches the format expectations
I would tend not to try and construct a test fixture in code here (I can see why you would for a simpler data structure) but use sample data inside tests/fixtures/
for any format with a bit of complexity. There is a cut-down .lst
file lying around in the test fixtures for this project.
Perhaps there are good reasons not to! E.g. to deliberately break the fixtures in case the instrument outputs bad metadata, and see how our code handles that. I know my test style tends to be "happy path" ...
I've left it as-is though, just padded it out with dummy data until it matches what the microscope generates
out.write("blah") | ||
# The task `requires` DecollageImages, but that requires other tasks, which run first | ||
# Rather than mock its output, or the whole chain, require a mock task that replaces it | ||
mock_output = mocker.patch(f'cyto_ml.pipeline.pipeline_decollage.UploadDecollagedImagesToS3.requires') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was more interesting! As it stood previously the code had mocked output for the pipeline task that this one requires
That output was a str
where a subclass of luigi.Target
was required (and the task runner triggered a failure because of that). Tried replacing the output with a MockTarget
but then all the sub-tasks that DecollageImages
has in its requires
function started to run before it ever got to returning its mocked output
!
So I did this instead, replaced the whole requires
chain with a single MockTask
that returns an arbitrary file output, which we create first.
This was a good path into understanding Luigi! These complex dependencies with different data stages are hard to unit test like this, I'm impressed by the subtlety with which you did it. In your place I'd probably have written a big baggy "integration"-style test running the whole task graph and using moto-server
, and that would have been much less efficient :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this code here and fix the tests.
Adds the pipeline from the temporary project at https://github.com/NERC-CEH/plankton_pipeline_luigi/ plus its docs, and bugfixes for the associated tests.
Task
functions have return types (ruff check
which we have in the CI pipeline, following dri-cd checks for this)I'll leave comments about the specific test issues against the related code, a couple of them were quite gritty
This is more FYI @albags as it's originally your code! I'm happy to merge this and for you to remove the standalone project. #38 has more notes on the benefits we get from keeping it inside this one, I'm happy to argue about it there :D