What are the user needs for the catalog API? #1978

merelcht · 2022-10-26T13:24:31Z

Description

The catalog API is an old component of Kedro. It can do with refactoring. In order to re-design the API we need to find out what are user needs are for it.

Some of the questions that need to be answered:

Why do users need to access datasets from the catalog directly?
Why do users need to get the dataset filepath?

Context

#1778

Target users

Advanced users: those who interact with the API directly
Plugin developers + kedro-viz team

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-02-21T14:43:29Z

To be expanded: kedro-org/kedro-plugins#557 (comment)

astrojuanlu · 2024-02-22T10:57:49Z

Introduction

The DataCatalog Python class centralises access to data in Kedro projects. Its use is invisible for regular users (that mostly interact with the YAML API by defining a catalog.yml file) but it is extensively accessed by plugins. Instances DataCatalog are passed to several hooks, including after_catalog_created, the Node hooks, and the Pipeline hooks. DataCatalog has existed in its current form since basically forever (private link).

Note

From now on, users refers to users of the DataCatalog Python class, which we could consider "advanced users".

Internally, a DataCatalog instance is essentially a dictionary that maps dataset names to dataset objects:

kedro/kedro/io/data_catalog.py

Line 142 in 436d17e

datasets: dict[str, AbstractDataset] | None = None,

Said dataset objects, at present subclasses of AbstractDataset (unimportant implementation detail), contain the load and save logic for specific data formats and Python libraries. For example, pandas.CSVDataset knows how to load and save CSV files using the pandas library.

Problem

The DataCatalog has a number of public methods and properties. Notably, DataCatalog.load and DataCatalog.save allow users to load and save data. The way this works is by delegating the actual load and save operations to the underlying dataset:

kedro/kedro/io/data_catalog.py

Line 490 in 436d17e

result = dataset.load()

To access the underlying dataset objects, on the other hand, DataCatalog offers only one undocumented public property, aptly named .datasets, that retrieves a frozen version of the loaded datasets:

kedro/kedro/io/data_catalog.py

Line 187 in 436d17e

self.datasets = _FrozenDatasets(self._datasets)

This was introduced in https://github.com/McK-Private/private-kedro/pull/84 (private link) to "Addresses issue raised / feature requested regarding the ability to view which datasets are available in the data catalog via tab completion in an ipython or jupyter session."

The full rationale behind having this frozen got lost ~~like tears in rain~~ in a Jira ticket that's no longer accessible. However, allowing the users to inject datasets during the session would probably be a recipe for trouble, and hence this was deliberately disabled:

kedro/kedro/io/data_catalog.py

Lines 121 to 128 in 436d17e

    
           # Don't allow users to add/change attributes on the fly 
        
           def __setattr__(self, key: str, value: Any) -> None: 
        
               msg = "Operation not allowed! " 
        
               if key in self.__dict__: 
        
                   msg += "Please change datasets through configuration." 
        
               else: 
        
                   msg += "Please use DataCatalog.add() instead." 
        
               raise AttributeError(msg)

The problem is that, for some reason, Kedro plugins tend to use the private APIs instead of the public frozen list of datasets. For example:

Kedro-Viz uses DataCatalog._get_dataset (originally introduced in https://github.com/McK-Private/private-kedro/pull/404/ (private link) to serve our versioning) https://github.com/kedro-org/kedro-viz/blob/fdcda72f13e406ed7a2cbac5f454de00c9193cc6/package/kedro_viz/data_access/repositories/catalog.py#L103-L104, https://github.com/kedro-org/kedro-viz/blob/fdcda72f13e406ed7a2cbac5f454de00c9193cc6/package/kedro_viz/data_access/repositories/catalog.py#L128-L131
kedro-mlflow uses DataCatalog._datasets https://github.com/Galileo-Galilei/kedro-mlflow/blob/dc3b7bf98a50f1dc5048b85e42a2620bde1de466/kedro_mlflow/framework/hooks/mlflow_hook.py#L156

This was already discussed in Tech Design on October 2022 #1778 (comment)

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way.

There's a possibility that this is preventing Kedro Viz from fully supporting dataset factories for their experiment tracking functionality kedro-org/kedro-viz#1480.

Research questions

At a minimum, we would like to understand:

Why do users need to access datasets from the catalog directly? (verbatim from @merelcht's comment above)
Why does the current "frozen" .catalog property not serve users who resort to private APIs instead?

astrojuanlu · 2024-02-23T11:23:28Z

One more evidence point:

Vizro uses DataCatalog._get_dataset https://github.com/mckinsey/vizro/blob/7b4131ccd31c46c9012caebdbb0721a1bcb48bc0/vizro-core/src/vizro/integrations/kedro/_data_manager.py#L23-L24

astrojuanlu · 2024-02-23T16:15:40Z

And another one:

v6d https://github.com/v6d-io/v6d/blob/9a15b3760671c82a157349c41538d9be5c631ad8/python/vineyard/contrib/kedro/plugins/hook.py#L47-L54

astrojuanlu · 2024-06-06T11:03:55Z

@iamelijahko opened a new summary ticket here #3934 so we have fulfilled the original intent of understanding "What are the user needs for the catalog API". Closing this one!

merelcht added the Stage: User Research 🔬 Ticket needs to undergo user research before implementation label Oct 26, 2022

merelcht added this to the Redesign Catalog and Datasets milestone Oct 26, 2022

merelcht mentioned this issue Oct 26, 2022

Re-design io.core and io.data_catalog #1778

Open

This was referenced Feb 9, 2024

Improve _FrozenDatasets class #3610

Closed

Make it easy to get the correct file path of a dataset #3611

Closed

astrojuanlu mentioned this issue Feb 21, 2024

fix: telemetry data and add example_pipeline kedro-org/kedro-plugins#557

Merged

4 tasks

merelcht assigned iamelijahko Feb 27, 2024

astrojuanlu self-assigned this Mar 4, 2024

merelcht assigned ElenaKhaustova and unassigned astrojuanlu Apr 10, 2024

This comment was marked as outdated.

Sign in to view

astrojuanlu closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the user needs for the catalog API? #1978

What are the user needs for the catalog API? #1978

merelcht commented Oct 26, 2022 •

edited

Loading

astrojuanlu commented Feb 21, 2024

astrojuanlu commented Feb 22, 2024 •

edited

Loading

astrojuanlu commented Feb 23, 2024

astrojuanlu commented Feb 23, 2024

This comment was marked as outdated.

astrojuanlu commented Jun 6, 2024

What are the user needs for the catalog API? #1978

What are the user needs for the catalog API? #1978

Comments

merelcht commented Oct 26, 2022 • edited Loading

Description

Context

Target users

astrojuanlu commented Feb 21, 2024

astrojuanlu commented Feb 22, 2024 • edited Loading

Introduction

Problem

Research questions

astrojuanlu commented Feb 23, 2024

astrojuanlu commented Feb 23, 2024

This comment was marked as outdated.

astrojuanlu commented Jun 6, 2024

merelcht commented Oct 26, 2022 •

edited

Loading

astrojuanlu commented Feb 22, 2024 •

edited

Loading