Skip to content

Commit

Permalink
Add documentation for dataset factories feature (#2670)
Browse files Browse the repository at this point in the history
* Add documentation for dataset factories

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Apply suggestions from code review

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs with more examples

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Minor edits

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Apply suggestions from code review

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

* Add namespace example

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Minor edit: Add explanation to example 2

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Update example

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

---------

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
  • Loading branch information
3 people authored Jul 6, 2023
1 parent 6da8bde commit 9500d60
Showing 1 changed file with 206 additions and 1 deletion.
207 changes: 206 additions & 1 deletion docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ CSVDataSet(
```


## Load multiple datasets with similar configuration
## Load multiple datasets with similar configuration using YAML anchors

Different datasets might use the same file format, load and save arguments, and be stored in the same folder. [YAML has a built-in syntax](https://yaml.org/spec/1.2.1/#Syntax) for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the `catalog.yml` file.

Expand Down Expand Up @@ -461,6 +461,211 @@ airplanes:
In this example, the default `csv` configuration is inserted into `airplanes` and then the `load_args` block is overridden. Normally, that would replace the whole dictionary. In order to extend `load_args`, the defaults for that block are then re-inserted.

## Load multiple datasets with similar configuration using dataset factories
For catalog entries that share configuration details, you can also use the dataset factories introduced in Kedro 0.18.11. This syntax allows you to generalise the configuration and
reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns.

### Example 1: Generalise datasets with similar names and types into one dataset factory
Consider the following catalog entries:
```yaml
factory_data:
type: pandas.CSVDataSet
filepath: data/01_raw/factory_data.csv
process_data:
type: pandas.CSVDataSet
filepath: data/01_raw/process_data.csv
```
The datasets in this catalog can be generalised to the following dataset factory:
```yaml
"{name}_data":
type: pandas.CSVDataSet
filepath: data/01_raw/{name}_data.csv
```
When `factory_data` or `process_data` is used in your pipeline, it is matched to the factory pattern `{name}_data`. The factory pattern must always be enclosed in
quotes to avoid YAML parsing errors.


### Example 2: Generalise datasets of the same type into one dataset factory
You can also combine all the datasets with the same type and configuration details. For example, consider the following
catalog with three datasets named `boats`, `cars` and `planes` of the type `pandas.CSVDataSet`:
```yaml
boats:
type: pandas.CSVDataSet
filepath: data/01_raw/shuttles.csv
cars:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv
planes:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
```
These datasets can be combined into the following dataset factory:
```yaml
"{dataset_name}#csv":
type: pandas.CSVDataSet
filepath: data/01_raw/{dataset_name}.csv
```
You will then have to update the pipelines in your project located at `src/<project_name>/<pipeline_name>/pipeline.py` to refer to these datasets as `boats#csv`,
`cars#csv` and `planes#csv`. Adding a suffix or a prefix to the dataset names and the dataset factory patterns, like `#csv` here, ensures that the dataset
names are matched with the intended pattern.
```python
from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_boats,
inputs="boats#csv",
outputs="preprocessed_boats",
name="preprocess_boats_node",
),
node(
func=preprocess_cars,
inputs="cars#csv",
outputs="preprocessed_cars",
name="preprocess_cars_node",
),
node(
func=preprocess_planes,
inputs="planes#csv",
outputs="preprocessed_planes",
name="preprocess_planes_node",
),
node(
func=create_model_input_table,
inputs=[
"preprocessed_boats",
"preprocessed_planes",
"preprocessed_cars",
],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)
```
### Example 3: Generalise datasets using namespaces into one dataset factory
You can also generalise the catalog entries for datasets belonging to namespaced modular pipelines. Consider the
following pipeline which takes in a `model_input_table` and outputs two regressors belonging to the
`active_modelling_pipeline` and the `candidate_modelling_pipeline` namespaces:
```python
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
from .nodes import evaluate_model, split_data, train_model
def create_pipeline(**kwargs) -> Pipeline:
pipeline_instance = pipeline(
[
node(
func=split_data,
inputs=["model_input_table", "params:model_options"],
outputs=["X_train", "y_train"],
name="split_data_node",
),
node(
func=train_model,
inputs=["X_train", "y_train"],
outputs="regressor",
name="train_model_node",
),
]
)
ds_pipeline_1 = pipeline(
pipe=pipeline_instance,
inputs="model_input_table",
namespace="active_modelling_pipeline",
)
ds_pipeline_2 = pipeline(
pipe=pipeline_instance,
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
return ds_pipeline_1 + ds_pipeline_2
```
You can now have one dataset factory pattern in your catalog instead of two separate entries for `active_modelling_pipeline.regressor`
and `candidate_modelling_pipeline.regressor` as below:
```yaml
{namespace}.regressor:
type: pickle.PickleDataSet
filepath: data/06_models/regressor_{namespace}.pkl
versioned: true
```
### Example 4: Generalise datasets of the same type in different layers into one dataset factory with multiple placeholders

You can use multiple placeholders in the same pattern. For example, consider the following catalog where the dataset
entries share `type`, `file_format` and `save_args`:
```yaml
processing.factory_data:
type: spark.SparkDataSet
filepath: data/processing/factory_data.pq
file_format: parquet
save_args:
mode: overwrite
processing.process_data:
type: spark.SparkDataSet
filepath: data/processing/process_data.pq
file_format: parquet
save_args:
mode: overwrite
modelling.metrics:
type: spark.SparkDataSet
filepath: data/modelling/factory_data.pq
file_format: parquet
save_args:
mode: overwrite
```
This could be generalised to the following pattern:
```yaml
"{layer}.{dataset_name}":
type: spark.SparkDataSet
filepath: data/{layer}/{dataset_name}.pq
file_format: parquet
save_args:
mode: overwrite
```
All the placeholders used in the catalog entry body must exist in the factory pattern name.

### Example 5: Generalise datasets using multiple dataset factories
You can have multiple dataset factories in your catalog. For example:
```yaml
"{namespace}.{dataset_name}@spark":
type: spark.SparkDataSet
filepath: data/{namespace}/{dataset_name}.pq
file_format: parquet
"{dataset_name}@csv":
type: pandas.CSVDataSet
filepath: data/01_raw/{dataset_name}.csv
```

Having multiple dataset factories in your catalog can lead to a situation where a dataset name from your pipeline might
match multiple patterns. To overcome this, Kedro sorts all the potential matches for the dataset name in the pipeline and picks the best match.
The matches are ranked according to the following criteria :
1. Number of exact character matches between the dataset name and the factory pattern. For example, a dataset named `factory_data$csv` would match `{dataset}_data$csv` over `{dataset_name}$csv`.
2. Number of placeholders. For example, the dataset `preprocessing.shuttles+csv` would match `{namespace}.{dataset}+csv` over `{dataset}+csv`.
3. Alphabetical order

### Example 6: Generalise all datasets with a catch-all dataset factory to overwrite the default `MemoryDataSet`
You can use dataset factories to define a catch-all pattern which will overwrite the default `MemoryDataSet` creation.
```yaml
"{default_dataset}":
type: pandas.CSVDataSet
filepath: data/{default_dataset}.csv
```
Kedro will now treat all the datasets mentioned in your project's pipelines that do not appear as specific patterns or explicit entries in your catalog
as `pandas.CSVDataSet`.

## Transcode datasets

Expand Down

0 comments on commit 9500d60

Please sign in to comment.