Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for dataset factories feature #2670

Merged
merged 10 commits into from
Jul 6, 2023
Merged

Conversation

ankatiyar
Copy link
Contributor

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

Resolve #2666
To be merged with/after #2635

Development notes

Documentation for the dataset factories feature to the "The Data Catalog" page.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start for the docs!

Some comments: I think we should refer to the feature as "dataset factories" instead of "dataset factory patterns". IMO the complete dataset block, e.g.:

"{namespace}.{dataset_name}#csv":
   type: pandas.CSVDataSet
   filepath: data/{namespace}/{dataset_name}.csv

is the dataset factory and the name: "{namespace}.{dataset_name}#csv" is the pattern.

Several users mentioned in the interview that they would like to see more examples, specifically how the patterns are referenced in the pipeline. The data catalog page is super long and I know @stichbury is keen to restructure it, so I can imagine dataset factories will perhaps get a separate page, but for now I think it's important to include lots of examples even if that increases this page more.

docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
Copy link
Contributor

@stichbury stichbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, with the suggestions. I made a few edits to @merelcht's to add a few more minor fixes but they're all good and this is such a great feature. I think we could do a mini blog post about this with some examples, maybe? WDYT?

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
@merelcht
Copy link
Member

I think we could do a mini blog post about this with some examples, maybe? WDYT?

A blog post would be great! I think we should also do a show and tell session for our users with this feature to drive adoption. There was so much interest for the interviews so I can imagine lots of people would like to know about this.

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@ankatiyar
Copy link
Contributor Author

@merelcht & @stichbury Thanks for the feedback! I've incorporated all the edits. I've also added more examples and structured them under separate subheadings. Would you mind taking another look and letting me know if I've missed anything? :)

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great now! 👍 Left some small edits and suggestions.

docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
docs/source/data/data_catalog.md Outdated Show resolved Hide resolved
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
@merelcht
Copy link
Member

Actually one more addition after having gone through all the interviews again: an example of how to use namespaces. E.g. a pattern that's "{namespace}.{dataset_name}#csv" and then in the pipeline they wouldn't have to explicitly add the namespace because Kedro adds that.

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @ankatiyar! Just added one nitpick, but approved anyway

```
The datasets in this catalog can be generalised to the following dataset factory:
```yaml
"{name}_data":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the first example, I'm wondering if it would be good to add an "info" admonition saying

"If you don't use a suffix the default dataset will be overridden, see [this section below]"

Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the first example basic enough to get started but the second example actually shows a situation where you might need to add a suffix/prefix. I added a small explanation there as to why.

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
```
These datasets can be combined into the following dataset factory:
```yaml
"{dataset_name}#csv":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is # a convention we recommend? I know it doesn't matter what character it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe something else like @, $, %? (Sorry if this has been treated already elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not something we recommend afaik, I've tried to use different delimiters for each example, underscore, hash symbol, period etc

Comment on lines 494 to 496
shuttles:
type: pandas.CSVDataSet
filepath: data/01_raw/shuttles.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is just an example, but it reuses a lot of spaceflights tutorial, if I create a spaceflights project and do this I get error because I only have shuttles.xlsx but not shuttles.csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I changed it to be csv just for the example but you are right, it might cause confusion. Changed it!

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@ankatiyar ankatiyar mentioned this pull request Jun 20, 2023
9 tasks
@ankatiyar ankatiyar enabled auto-merge (squash) July 6, 2023 15:06
@ankatiyar ankatiyar merged commit 9500d60 into main Jul 6, 2023
@ankatiyar ankatiyar deleted the docs/dataset-factories branch July 6, 2023 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation for the dataset factories feature
5 participants