You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data has been accidentally overwritten in the past, after copy pasting a catalog entry for deriving a new one, and forgetting to change the filepath. I feel it would be useful to protect against this kind of situation by expecting catalog entries to have unique filepaths by default, and throwing an error when this is not the case, with certain sensible opt outs the user / developer can add.
Context
This would prevent some accidental overwriting of data by users, while still allowing unchanged functionality for when catalog entries are expected to share filepaths (e.g. SQLDatasets, transcoded entries).
Possible Implementation
By default check and throw an error for duplicate filepaths across the entire catalog, with the following exceptions
ignore transcoded entries (these are expected to share filepaths)
ignore overwrite: True flagged entries (or something like this)
Might be this is a flag to add to datasets (e.g. SQLDataset) rather than catalog entries
And then can overrule dataset setting with catalog entry flagging, so csv files say can be allowed to overwrite where desired, and sql tables can be prevented from overwrite where desired
Errors for my_first_csv_dataset and my_first_edited_csv_dataset sharing filepaths, but not for my_first_alt_edited_csv_dataset
NO errors for my_second_csv_dataset@pandas and my_second_csv_dataset@spark
An error for my_alt_edited_sql_dataset, but not for my_sql_dataset or my_alt_edited_sql_dataset
Possible Alternatives
Add a flag for running with no duplicate filepaths expected. Throw an error if they are detected, otherwise don't. Could make this default behaviour at a later date if sees popular use. However, this is not a versatile solution, as some pipelines may have a mixture of catalog entires they would and would not expect to be overwritten.
The text was updated successfully, but these errors were encountered:
I'm trying to think about how this could work - as part of @ElenaKhaustova and @iamelijahko 's excellent DataCatalog research (#3934) there is now an initiative to make a consistent API for datasets to expose the file path as a public method: #3929
I think once the public API ticket is in, it would be really easy to write some sort of after_catalog_created validation hook to where you just collect all the filepath attributes and throw an error if you see more than one instance. The only complications I can maybe see in this pattern is ensuring we validate the rendered file path at runtime rather than any templated / factory file paths which are expressed differently at rest.
Maybe these checks could be preformed by a separate optional function that does catalog validation.
This
Allows users to validate the catalog if needed
Does not restrict the cases where multiple datasets point to the same file to a specific subsets of allowed case.
This reduces the risk that users need something we are not thinking about and the whole catalog breaks by default
Does not force the users to add more flags like the overwrite flag suggested above, unless the user specifically decides to run the checks
E.g.: Sometimes it is useful to have e.g. a common way to update datasets in kedro is to define "input_dataset" and "updated_dataset" pointing to the same file, so you can have a function that take.
Description
Data has been accidentally overwritten in the past, after copy pasting a catalog entry for deriving a new one, and forgetting to change the filepath. I feel it would be useful to protect against this kind of situation by expecting catalog entries to have unique filepaths by default, and throwing an error when this is not the case, with certain sensible opt outs the user / developer can add.
Context
This would prevent some accidental overwriting of data by users, while still allowing unchanged functionality for when catalog entries are expected to share filepaths (e.g. SQLDatasets, transcoded entries).
Possible Implementation
By default check and throw an error for duplicate filepaths across the entire catalog, with the following exceptions
overwrite: True
flagged entries (or something like this)So for a
catalog.yml
with:There would be
my_first_csv_dataset
andmy_first_edited_csv_dataset
sharing filepaths, but not formy_first_alt_edited_csv_dataset
my_second_csv_dataset@pandas
andmy_second_csv_dataset@spark
my_alt_edited_sql_dataset
, but not formy_sql_dataset
ormy_alt_edited_sql_dataset
Possible Alternatives
Add a flag for running with no duplicate filepaths expected. Throw an error if they are detected, otherwise don't. Could make this default behaviour at a later date if sees popular use. However, this is not a versatile solution, as some pipelines may have a mixture of catalog entires they would and would not expect to be overwritten.
The text was updated successfully, but these errors were encountered: