create a data-processing pipeline #49

Geet-George · 2023-11-02T16:38:42Z

What do I mean by pipeline?

A "pipeline" simply means a sequence of data-processing steps executed in series, where the output of one element is the input of the next one.

Each step in the pipeline corresponds to a level of the data-product (L1, L2, L3, L4) and is associated with a set of substeps, each of which involve a set of functions that are executed to process the data and move it to the next step.
The ~~default~~ pipeline is defined in a separate Python module (pipeline.py), which maps each level to a list of functions (or substeps) that should be executed to reach that level from the previous one.

This default pipeline can be modified by the user through the configuration file. The configuration file allows the user to specify functions for each substep in the pipeline. If a substep is not included in the configuration file, the default functions for that substep will be executed. So, the smallest unit that a user changes in the pipeline is a sub-step. If a substep is defined in the config file, it completely replaces the default substep. The argument values for the functions in the pipeline can also be configured by the user through a configuration file (this part is explained by commit 0612bc1).

The pipeline is executed by iterating over the levels in order, and for each level, executing the associated functions with the provided arguments. The arguments for each function are retrieved from the configuration file. This pipeline is thus the flow of processing data into levels for the final dataset.

The main() function currently does the following:
✅ gets default values of all args in all functions in the package
✅ checks for any user-provided non-default values to the args in the functions from the config file
✅ gets mandatory arg values from the config file

What the main() function should do next is:

access the default pipeline
~~get non-default substeps for the pipeline from the configuration file~~
execute pipeline after accounting for the non-default parts of ~~the pipeline and~~ fn args

For the above, we must:

create a default pipeline #51

The text was updated successfully, but these errors were encountered:

Geet-George · 2023-11-03T14:38:58Z

Allowing the user to configure parts of the pipeline will be tricky, because of how one function's output works as the input of the next is not consistent. Therefore, it is best to not let the user play around with the pipeline. The only part within the pipeline that the user can modify will be deciding which QC checks to run (which is already done by commit dfca228 and is being implemented in PR #46). Most other things that I can think of (e.g. changing vertical grid spacing, deciding between filtering QC-failed sondes and flagging them, gaussian or regression for L4, etc.) can be changed by the user by means of changing the default values of function arguments. Therefore, let's keep it simple and not keep the pipeline flexible.

Geet-George mentioned this issue Nov 2, 2023

list input of qc functions can be provided by the user with the config file #48

Closed

2 tasks

Geet-George changed the title ~~create a pipeline~~ create a data-processing pipeline Nov 2, 2023

Geet-George mentioned this issue Nov 2, 2023

the user can also decide if based on the qc functions the sonde will be filtered or carried to L2. #50

Closed

Geet-George linked a pull request Nov 3, 2023 that will close this issue

default pipeline for data processing #52

Merged

1 task

Geet-George closed this as completed in #52 Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create a data-processing pipeline #49

create a data-processing pipeline #49

Geet-George commented Nov 2, 2023 •

edited

Loading

Geet-George commented Nov 3, 2023

create a data-processing pipeline #49

create a data-processing pipeline #49

Comments

Geet-George commented Nov 2, 2023 • edited Loading

What do I mean by pipeline?

Geet-George commented Nov 3, 2023

Geet-George commented Nov 2, 2023 •

edited

Loading