Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

Open
ElenaKhaustova opened this issue Jun 5, 2024 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 5, 2024

Description

  1. Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
  2. Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
  3. Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.

We propose to explore the feasibility of implementing to_yaml() and from_yaml() methods for the DataCatalog object to facilitate serialization and deserialization without dependency on Kedro versions.

Context

User feedback:

  • Add workflow is missing persistency, so you can not save modified catalog: "You have a catalog and then you start adding extra stuff to it, currently we just throw away those added things when they close a notebook."
  • Catalog to YAML function is needed to save modified catalog: "People have always asked for it. Could I have a catalog to YAML function so that you could actually spit out the YAML files that are needed to do this again later on?"
  • Competitors provide the functionality to compile catalog and showcase the result: "I would point to the DPC compile workflow. And actually, if you do DBT run it does DBT compile first and then runs the compiled outputs. Whereas in Kedro, you have your very concise complicated YAML and it will all that compilation happens at run time and there's no way for the user to see it."
  • When pickling DataCatalog object they experience difficulties in loading it back if the kedro version is different: "Serialization is an issue because I often pickle a catalog (mostly as part of a mlflow model). Pickling the catalog is really something that leads to a lot of problems because if I don't have the exact same Kedro version when I want to load the catalog, if the object has any change inside - private method or attribute it will lead to error."

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/mlflow/kedro_pipeline_model.py#L143

# pseudo code
pickle.dumps(catalog)
pickle.loads(catalog) # this will fail if I reload with a newer kedro version and any attributes (even private) has changed. This breaks much more often that we should expect. 

"It would be much more robust to be able to do this":

# pseudo code
catalog.serialize("path/catalog.yml") # name TBD: serialize? to_config? to_yaml? to_json? to_dict? 
catalog.deserialize(catalog) # much more robust since it is not stored as python object -> maybe catalog.from_config? 

Extra context: #3995 (comment)

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 5, 2024
@astrojuanlu
Copy link
Member

Very similar to DataCatalog.from_file proposal discussed in #2967

@datajoely
Copy link
Contributor

I like to_yaml() and from_yaml() personally.

  • It would be nice if we preserved comments and the way the user organised their files before. I appreciate this increases complexity - but it does match the mental model of how the user things about their project.
  • I'm currently working with Pydantic a lot at the moment, I wonder if it makes sense to use or at least take some inspiration.

@merelcht merelcht changed the title [DataCatalog]: Catalog serialization and deserialization support [DataCatalog]: Spike - Catalog serialization and deserialization support Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

3 participants