📝 Various documentation improvement especially about PipelineML filte…

…ring (#554)
Galileo-Galilei · Aug 26, 2024 · 0e4fe31 · 0e4fe31
1 parent d4267a9
commit 0e4fe31
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 7 deletions.
diff --git a/docs/source/05_framework_ml/01_why_framework.md b/docs/source/05_framework_ml/01_why_framework.md
@@ -14,7 +14,7 @@ Actually, there is a confusion on what "deployment" means, especially in big ent
 
 - **scalability and cost control**: in many cases, you need to be able to deal with a lot of (possibly concurrent) requests (likely much more than during training phase). It may be hard to ensure that the app will be able to deal with such an important amount of data. Another issue is that ML oftens needs specific infrastrusture (e.g., GPU's) which are very expensive. Since the request against the model often vary wildly over time, it may be important to adapt the infrastructure in real time to avoid a lot of infrastructure costs
 - **speed**: A lot of recent *state of the art* (SOTA) deep learning / ensemble models are computationally heavy and this may hurt inference speed, which is critical for some systems.
-- **disponibility and resilience**: Machine learning systems are more complex than traditional softwares because they have more moving parts (i.e. data and parameters). This increases the risk of errors, and since ML systems are used for critical systems making both the infrastructure and the code robust is key.
+- **availability and resilience**: Machine learning systems are more complex than traditional softwares because they have more moving parts (i.e. data and parameters). This increases the risk of errors, and since ML systems are used for critical systems making both the infrastructure and the code robust is key.
 - **portability / ease of integration with external components**: ML models are not intended to be directly used by the end users, but rather be consumed by another part of your system (e.g a call to an API). To speed up deployment, your model must be easy to be consumed, i.e. *as self contained as possible*. As a consequence, you must **deploy a ML pipeline which hanbles business objects instead of only a ML model**. If the other part which consumes your API needs to make a lot of data preprocessing *before* using your model, it makes it:
   - very risky to use, because preprocessing and model are decoupled: any change in your model must be reflected in this other data pipeline and there is a huge mismatch risk when redeploying
   - slow and costful to deploy because each deployment of your model needs some new development on the client side
@@ -40,7 +40,7 @@ Since it is a more mature industry, efficient tools exists to manage these items
 - parameters
 - data
 
-As ML is a much less mature field, efficient tooling to adress these items are very recent and not completely standardized yet (e.g. Mlflow to track parameters, DVC to version data, `great-expectations` to monitor data which go through your pipelines, `tensorboard` to monitor your model metrics...)
+As machine learning is a much less mature field, efficient tooling to adress these items are very recent and not completely standardized yet (e.g. ``Mlflow`` to track parameters, ``DVC`` to version data, `great-expectations` to ensure data quality checks along your pipelines, `tensorboard` to monitor your model metrics...)
 
 > **Mlflow is one of the most mature tool to manage these new moving parts.**
 
@@ -112,7 +112,7 @@ As stated previous paragraph, the inference pipeline is not a primary concern wh
 - in the best case, you have trained the model from a git sha which is logged in mlflow. Any potential user can (but it takes time) recreate the exact inference pipeline from your source code, and retrieve all necessary artifacts from mlflow. This is tedious, error prone, and gives a lot of responsibility and work to your end user, but at least it makes your model usable.
 - most likely, you did not train your model from a version control commit. While experimenting /debug, it is very common to modify the code and retrain without committing. The exact code associated to a given model will likely be impossible to find out later.  
 
-> `kedro-mlflow` offers a `PipelineML` (and its helpers `pipeline_ml_factory`) class which binds the `training` and `inference` pipeline, and a hook which autolog such pipelines when they are run. This enables data scientists to ensure that each training model is logged with its associated inference pipeline, and is ready to use for any end user. This decreases a lot the necessary cognitive complexity to ensure coherence between training and inference.
+> `kedro-mlflow` offers a `PipelineML` (and its helpers `pipeline_ml_factory`) class which binds the `training` and `inference` pipeline (similarly to ``scikit-learn`` ``Pipeline`` object), and a hook which autolog such pipelines when they are run. This enables data scientists to ensure that each training model is logged with its associated inference pipeline, and is ready to use for any end user. This decreases a lot the necessary cognitive complexity to ensure coherence between training and inference.
 
 ### Issue 4: Data scientists do not handle business objects
 

diff --git a/docs/source/05_pipeline_serving/01_mlflow_models.md b/docs/source/05_pipeline_serving/01_mlflow_models.md
@@ -4,7 +4,6 @@
 
 [Mlflow Models are a standardised agnostic format to store machine learning models](https://www.mlflow.org/docs/latest/models.html). They intend to be standalone to be as portable as possible to be deployed virtually anywhere and mlflow provides built-in CLI commands to deploy a mlflow model to most common cloud platforms or to create an API.
 
-
 A Mlflow Model is composed of:
 - a ``MLModel`` file which is a configuration file to indicate to mlflow how to load the model. This file may also contain the ``Signature`` of the model (i.e. the ``Schema`` of the input and output of your model, including the columns names and order) as well as example data.  
 - a ``conda.yml`` file which contains the specifications of the virtual conda environment inside which the model should run. It contains the packages versions necessary for your model to be executed.
@@ -19,4 +18,6 @@ You can log any Kedro ``Pipeline`` matching the following requirements:
 - one of its input must be a ``pandas.DataFrame``, a ``spark.DataFrame`` or a ``numpy.array``. This is the **input which contains the data to predict on**. This can be any Kedro ``AbstractDataset`` which loads data in one of the previous three formats. It can also be a ``MemoryDataset`` and not be persisted in the ``catalog.yml``.
 - all its other inputs must be persisted on disk (e.g. if the machine learning model must already be trained and saved so we can export it).
 
-*Note: if the pipeline has parameters, they will be persisted before exporting the model, which implies that you will not be able to modify them at runtime. This is a limitation of ``mlflow``.*
+```{note}
+If the pipeline has parameters, they will be persisted before exporting the model, which implies that you will not be able to modify them at runtime. This is a limitation of ``mlflow<2.6.0``, recently relaxed and that will be adressed by https://github.com/Galileo-Galilei/kedro-mlflow/issues/445.
+```
diff --git a/docs/source/07_python_objects/03_Pipelines.md b/docs/source/07_python_objects/03_Pipelines.md
@@ -39,7 +39,11 @@ Note that:
 - the `inference` pipeline `inputs` must belong to training `outputs` (vectorizer, binarizer, machine learning model...)
 - the `inference` pipeline must have one and only one `output`
 
-*Note: If you want to log a ``PipelineML`` object in ``mlflow`` programatically, you can use the following code snippet:*
+```{caution}
+``PipelineML`` objects do not implement all filtering methods of a regular ``Pipeline``, and you cannot add or substract 2 ``PipelineML`` together. The rationale is that a filtered ``PipelineML`` is not a ``PipelineML`` in general, because the [filtering is not consistent between training and inference](https://github.com/Galileo-Galilei/kedro-mlflow/issues/554). You can see the ones which are supported [in the code](https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/pipeline/pipeline_ml.py#L162).
+```
+
+You can also directly log a ``PipelineML`` object in ``mlflow`` programatically:
 
 ```python
 from pathlib import Path
@@ -72,4 +76,4 @@ It is also possible to pass arguments to `KedroPipelineModel` to specify the run
 KedroPipelineModel(pipeline=pipeline_training, catalog=catalog, copy_mode="assign")
 ```
 
-Available `copy_mode` are ``assign``, ``copy`` and ``deepcopy``. It is possible to pass a dictionary to specify different copy mode fo each dataset.
+Available `copy_mode` are ``assign``, ``copy`` and ``deepcopy``. It is possible to pass a dictionary to specify different copy mode for each dataset.