Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create a data-processing pipeline #49

Closed
4 tasks done
Tracked by #48
Geet-George opened this issue Nov 2, 2023 · 1 comment · Fixed by #52
Closed
4 tasks done
Tracked by #48

create a data-processing pipeline #49

Geet-George opened this issue Nov 2, 2023 · 1 comment · Fixed by #52

Comments

@Geet-George
Copy link
Owner

Geet-George commented Nov 2, 2023

What do I mean by pipeline?

A "pipeline" simply means a sequence of data-processing steps executed in series, where the output of one element is the input of the next one.

Each step in the pipeline corresponds to a level of the data-product (L1, L2, L3, L4) and is associated with a set of substeps, each of which involve a set of functions that are executed to process the data and move it to the next step.
The default pipeline is defined in a separate Python module (pipeline.py), which maps each level to a list of functions (or substeps) that should be executed to reach that level from the previous one.

This default pipeline can be modified by the user through the configuration file. The configuration file allows the user to specify functions for each substep in the pipeline. If a substep is not included in the configuration file, the default functions for that substep will be executed. So, the smallest unit that a user changes in the pipeline is a sub-step. If a substep is defined in the config file, it completely replaces the default substep. The argument values for the functions in the pipeline can also be configured by the user through a configuration file (this part is explained by commit 0612bc1).

The pipeline is executed by iterating over the levels in order, and for each level, executing the associated functions with the provided arguments. The arguments for each function are retrieved from the configuration file. This pipeline is thus the flow of processing data into levels for the final dataset.


The main() function currently does the following:
✅ gets default values of all args in all functions in the package
✅ checks for any user-provided non-default values to the args in the functions from the config file
✅ gets mandatory arg values from the config file

What the main() function should do next is:

  • access the default pipeline
  • get non-default substeps for the pipeline from the configuration file
  • execute pipeline after accounting for the non-default parts of the pipeline and fn args

For the above, we must:

@Geet-George
Copy link
Owner Author

Allowing the user to configure parts of the pipeline will be tricky, because of how one function's output works as the input of the next is not consistent. Therefore, it is best to not let the user play around with the pipeline. The only part within the pipeline that the user can modify will be deciding which QC checks to run (which is already done by commit dfca228 and is being implemented in PR #46). Most other things that I can think of (e.g. changing vertical grid spacing, deciding between filtering QC-failed sondes and flagging them, gaussian or regression for L4, etc.) can be changed by the user by means of changing the default values of function arguments. Therefore, let's keep it simple and not keep the pipeline flexible.

@Geet-George Geet-George linked a pull request Nov 3, 2023 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant