You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A "pipeline" simply means a sequence of data-processing steps executed in series, where the output of one element is the input of the next one.
Each step in the pipeline corresponds to a level of the data-product (L1, L2, L3, L4) and is associated with a set of substeps, each of which involve a set of functions that are executed to process the data and move it to the next step.
The default pipeline is defined in a separate Python module (pipeline.py), which maps each level to a list of functions (or substeps) that should be executed to reach that level from the previous one.
This default pipeline can be modified by the user through the configuration file. The configuration file allows the user to specify functions for each substep in the pipeline. If a substep is not included in the configuration file, the default functions for that substep will be executed. So, the smallest unit that a user changes in the pipeline is a sub-step. If a substep is defined in the config file, it completely replaces the default substep. The argument values for the functions in the pipeline can also be configured by the user through a configuration file (this part is explained by commit 0612bc1).
The pipeline is executed by iterating over the levels in order, and for each level, executing the associated functions with the provided arguments. The arguments for each function are retrieved from the configuration file. This pipeline is thus the flow of processing data into levels for the final dataset.
The main() function currently does the following:
✅ gets default values of all args in all functions in the package
✅ checks for any user-provided non-default values to the args in the functions from the config file
✅ gets mandatory arg values from the config file
Allowing the user to configure parts of the pipeline will be tricky, because of how one function's output works as the input of the next is not consistent. Therefore, it is best to not let the user play around with the pipeline. The only part within the pipeline that the user can modify will be deciding which QC checks to run (which is already done by commit dfca228 and is being implemented in PR #46). Most other things that I can think of (e.g. changing vertical grid spacing, deciding between filtering QC-failed sondes and flagging them, gaussian or regression for L4, etc.) can be changed by the user by means of changing the default values of function arguments. Therefore, let's keep it simple and not keep the pipeline flexible.
What do I mean by pipeline?
A "pipeline" simply means a sequence of data-processing steps executed in series, where the output of one element is the input of the next one.
Each step in the pipeline corresponds to a level of the data-product (L1, L2, L3, L4) and is associated with a set of substeps, each of which involve a set of functions that are executed to process the data and move it to the next step.
The
defaultpipeline is defined in a separate Python module (pipeline.py
), which maps each level to a list of functions (or substeps) that should be executed to reach that level from the previous one.This default pipeline can be modified by the user through the configuration file. The configuration file allows the user to specify functions for each substep in the pipeline. If a substep is not included in the configuration file, the default functions for that substep will be executed. So, the smallest unit that a user changes in the pipeline is a sub-step. If a substep is defined in the config file, it completely replaces the default substep.The argument values for the functions in the pipeline can also be configured by the user through a configuration file (this part is explained by commit 0612bc1).The pipeline is executed by iterating over the levels in order, and for each level, executing the associated functions with the provided arguments. The arguments for each function are retrieved from the configuration file. This pipeline is thus the flow of processing data into levels for the final dataset.
The
main()
function currently does the following:✅ gets default values of all args in all functions in the package
✅ checks for any user-provided non-default values to the args in the functions from the config file
✅ gets mandatory arg values from the config file
What the
main()
function should do next is:get non-default substeps for the pipeline from the configuration filethe pipeline andfn argsFor the above, we must:
The text was updated successfully, but these errors were encountered: