Add documentation for dataset factories feature (#2670)

* Add documentation for dataset factories Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Apply suggestions from code review Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs with more examples Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Minor edits Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Apply suggestions from code review Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Add namespace example Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Minor edit: Add explanation to example 2 Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Update example Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
kedro-org · Jul 6, 2023 · 9500d60 · 9500d60
1 parent 6da8bde
commit 9500d60
Showing 1 changed file with 206 additions and 1 deletion.
diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md
@@ -404,7 +404,7 @@ CSVDataSet(
 ```
 
 
-## Load multiple datasets with similar configuration
+## Load multiple datasets with similar configuration using YAML anchors
 
 Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file.
 
@@ -461,6 +461,211 @@ airplanes:
 
 In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted.
 
+## Load multiple datasets with similar configuration using dataset factories
+For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.11. This syntax allows you to generalise the configuration and
+reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns.
+
+### Example 1: Generalise datasets with similar names and types into one dataset factory
+Consider the following catalog entries:
+```yaml
+factory_data:
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/factory_data.csv
+
+
+process_data:
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/process_data.csv
+```
+The datasets in this catalog can be generalised to the following dataset factory:
+```yaml
+"{name}_data":
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/{name}_data.csv
+```
+When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in
+quotes to avoid YAML parsing errors.
+
+
+### Example 2: Generalise datasets of the same type into one dataset factory
+You can also combine all the datasets with the same type and configuration details. For example, consider the following
+catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`:
+```yaml
+boats:
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/shuttles.csv
+
+cars:
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/reviews.csv
+
+planes:
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/companies.csv
+```
+These datasets can be combined into the following dataset factory:
+```yaml
+"{dataset_name}#csv":
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/{dataset_name}.csv
+```
+You will then have to update the pipelines in your project located at `src/<project_name>/<pipeline_name>/pipeline.py` to refer to these datasets as `boats#csv`,
+`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset
+names are matched with the intended pattern.
+```python
+from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles
+
+
+def create_pipeline(**kwargs) -> Pipeline:
+    return pipeline(
+        [
+            node(
+                func=preprocess_boats,
+                inputs="boats#csv",
+                outputs="preprocessed_boats",
+                name="preprocess_boats_node",
+            ),
+            node(
+                func=preprocess_cars,
+                inputs="cars#csv",
+                outputs="preprocessed_cars",
+                name="preprocess_cars_node",
+            ),
+            node(
+                func=preprocess_planes,
+                inputs="planes#csv",
+                outputs="preprocessed_planes",
+                name="preprocess_planes_node",
+            ),
+            node(
+                func=create_model_input_table,
+                inputs=[
+                    "preprocessed_boats",
+                    "preprocessed_planes",
+                    "preprocessed_cars",
+                ],
+                outputs="model_input_table",
+                name="create_model_input_table_node",
+            ),
+        ]
+    )
+```
+### Example 3: Generalise datasets using namespaces into one dataset factory
+You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the
+following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the
+`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces:
+```python
+from kedro.pipeline import Pipeline, node
+from kedro.pipeline.modular_pipeline import pipeline
+
+from .nodes import evaluate_model, split_data, train_model
+
+
+def create_pipeline(**kwargs) -> Pipeline:
+    pipeline_instance = pipeline(
+        [
+            node(
+                func=split_data,
+                inputs=["model_input_table", "params:model_options"],
+                outputs=["X_train", "y_train"],
+                name="split_data_node",
+            ),
+            node(
+                func=train_model,
+                inputs=["X_train", "y_train"],
+                outputs="regressor",
+                name="train_model_node",
+            ),
+        ]
+    )
+    ds_pipeline_1 = pipeline(
+        pipe=pipeline_instance,
+        inputs="model_input_table",
+        namespace="active_modelling_pipeline",
+    )
+    ds_pipeline_2 = pipeline(
+        pipe=pipeline_instance,
+        inputs="model_input_table",
+        namespace="candidate_modelling_pipeline",
+    )
+
+    return ds_pipeline_1 + ds_pipeline_2
+```
+You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor`
+and `candidate_modelling_pipeline.regressor` as below:
+```yaml
+{namespace}.regressor:
+  type: pickle.PickleDataSet
+  filepath: data/06_models/regressor_{namespace}.pkl
+  versioned: true
+```
+### Example 4: Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders
+
+You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset
+entries share `type`, `file_format` and `save_args`:
+```yaml
+processing.factory_data:
+  type: spark.SparkDataSet
+  filepath: data/processing/factory_data.pq
+  file_format: parquet
+  save_args:
+    mode: overwrite
+
+processing.process_data:
+  type: spark.SparkDataSet
+  filepath: data/processing/process_data.pq
+  file_format: parquet
+  save_args:
+    mode: overwrite
+
+modelling.metrics:
+  type: spark.SparkDataSet
+  filepath: data/modelling/factory_data.pq
+  file_format: parquet
+  save_args:
+    mode: overwrite
+```
+This could be generalised to the following pattern:
+```yaml
+"{layer}.{dataset_name}":
+  type: spark.SparkDataSet
+  filepath: data/{layer}/{dataset_name}.pq
+  file_format: parquet
+  save_args:
+    mode: overwrite
+```
+All the placeholders used in the catalog entry body must exist in the factory pattern name.
+
+### Example 5: Generalise datasets using multiple dataset factories
+You can have multiple dataset factories in your catalog. For example:
+```yaml
+"{namespace}.{dataset_name}@spark":
+  type: spark.SparkDataSet
+  filepath: data/{namespace}/{dataset_name}.pq
+  file_format: parquet
+
+"{dataset_name}@csv":
+  type: pandas.CSVDataSet
+  filepath: data/01_raw/{dataset_name}.csv
+```
+
+Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might
+match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match.
+The matches are ranked according to the following criteria :
+1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`.
+2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`.
+3. Alphabetical order
+
+### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet`
+You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation.
+```yaml
+"{default_dataset}":
+  type: pandas.CSVDataSet
+  filepath: data/{default_dataset}.csv
+
+```
+Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog
+as `pandas.CSVDataSet`.
 
 ## Transcode datasets