Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the user needs for the catalog API? #1978

Closed
merelcht opened this issue Oct 26, 2022 · 6 comments
Closed

What are the user needs for the catalog API? #1978

merelcht opened this issue Oct 26, 2022 · 6 comments
Assignees
Labels
Stage: User Research 🔬 Ticket needs to undergo user research before implementation

Comments

@merelcht
Copy link
Member

merelcht commented Oct 26, 2022

Description

The catalog API is an old component of Kedro. It can do with refactoring. In order to re-design the API we need to find out what are user needs are for it.

Some of the questions that need to be answered:

  • Why do users need to access datasets from the catalog directly?
  • Why do users need to get the dataset filepath?

Context

#1778

Target users

  • Advanced users: those who interact with the API directly
  • Plugin developers + kedro-viz team
@merelcht merelcht added the Stage: User Research 🔬 Ticket needs to undergo user research before implementation label Oct 26, 2022
@astrojuanlu
Copy link
Member

To be expanded: kedro-org/kedro-plugins#557 (comment)

@astrojuanlu
Copy link
Member

astrojuanlu commented Feb 22, 2024

Introduction

The DataCatalog Python class centralises access to data in Kedro projects. Its use is invisible for regular users (that mostly interact with the YAML API by defining a catalog.yml file) but it is extensively accessed by plugins. Instances DataCatalog are passed to several hooks, including after_catalog_created, the Node hooks, and the Pipeline hooks. DataCatalog has existed in its current form since basically forever (private link).

Note

From now on, users refers to users of the DataCatalog Python class, which we could consider "advanced users".

Internally, a DataCatalog instance is essentially a dictionary that maps dataset names to dataset objects:

datasets: dict[str, AbstractDataset] | None = None,

Said dataset objects, at present subclasses of AbstractDataset (unimportant implementation detail), contain the load and save logic for specific data formats and Python libraries. For example, pandas.CSVDataset knows how to load and save CSV files using the pandas library.

Problem

The DataCatalog has a number of public methods and properties. Notably, DataCatalog.load and DataCatalog.save allow users to load and save data. The way this works is by delegating the actual load and save operations to the underlying dataset:

result = dataset.load()

To access the underlying dataset objects, on the other hand, DataCatalog offers only one undocumented public property, aptly named .datasets, that retrieves a frozen version of the loaded datasets:

self.datasets = _FrozenDatasets(self._datasets)

This was introduced in https://github.com/McK-Private/private-kedro/pull/84 (private link) to "Addresses issue raised / feature requested regarding the ability to view which datasets are available in the data catalog via tab completion in an ipython or jupyter session."

The full rationale behind having this frozen got lost like tears in rain in a Jira ticket that's no longer accessible. However, allowing the users to inject datasets during the session would probably be a recipe for trouble, and hence this was deliberately disabled:

# Don't allow users to add/change attributes on the fly
def __setattr__(self, key: str, value: Any) -> None:
msg = "Operation not allowed! "
if key in self.__dict__:
msg += "Please change datasets through configuration."
else:
msg += "Please use DataCatalog.add() instead."
raise AttributeError(msg)

The problem is that, for some reason, Kedro plugins tend to use the private APIs instead of the public frozen list of datasets. For example:

This was already discussed in Tech Design on October 2022 #1778 (comment)

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way.

There's a possibility that this is preventing Kedro Viz from fully supporting dataset factories for their experiment tracking functionality kedro-org/kedro-viz#1480.

Research questions

At a minimum, we would like to understand:

  • Why do users need to access datasets from the catalog directly? (verbatim from @merelcht's comment above)
  • Why does the current "frozen" .catalog property not serve users who resort to private APIs instead?

@astrojuanlu
Copy link
Member

@iamelijahko

This comment was marked as outdated.

@astrojuanlu
Copy link
Member

@iamelijahko opened a new summary ticket here #3934 so we have fulfilled the original intent of understanding "What are the user needs for the catalog API". Closing this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: User Research 🔬 Ticket needs to undergo user research before implementation
Projects
Archived in project
Status: Current
Development

No branches or pull requests

4 participants