Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset #881

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

tdhooghe
Copy link

@tdhooghe tdhooghe commented Oct 10, 2024

Description

  • Enable saving a Pandas DataFrame directly into a SnowflakeTable, i.e. to ingest a .csv directly into Snowflake within Kedro
  • Update tests to use Snowpark's local testing framework
  • Updated dependencies

Development notes

Pytest up until 3.11 (Snowpark unfortunately not compatible with 3.12 yet)
See https://github.com/kedro-org/kedro-plugins/actions/runs/11443609488
image

Bump of cloudpickle required to allow for snowflake-snowpark-python >= 1.23
image

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@tdhooghe tdhooghe changed the title (feature)Datasets: save pandas df directly to SnowflakeTable feat(datasets): save pandas df directly to SnowflakeTable Oct 10, 2024
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch 2 times, most recently from 0e281c2 to c9c1872 Compare October 10, 2024 19:47
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch from dc88a84 to 0a14bb7 Compare October 21, 2024 14:39
tdhooghe and others added 23 commits October 21, 2024 17:22
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
* feat(datasets): create separate `ibis.FileDataset`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): deprecate `TableDataset` file I/O

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets): implement `FileDataset` versioning

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): try `os.path.exists`, for Windows

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* revert(datasets): use pathlib, ignore Windows test

Refs: b7ff0c7

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* docs(datasets): add `ibis.FileDataset` to contents

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets)!: expose `load` and `save` publicly

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): remove second filepath assignment

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Update error code in e2e test

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
…kedro-org#891)

* Update PR template with checkbox for core dataset contribution

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Update .github/PULL_REQUEST_TEMPLATE.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

* Fix lint

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

---------

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
* fix(datasets): default to DuckDB in in-memory mode

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* test(datasets): use `object()` sentinel as default

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* docs(datasets): add default database to RELEASE.md

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
…ro-org#896)

* Add GH action to check for TSC votes on core dataset changes
* Ignore TSC vote action in gatekeeper
* Trigger TSC vote action only on changes in core dataset

---------

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: Thomas <thomas.dhooghe95@gmail.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch from cee8467 to 2c804c9 Compare October 21, 2024 15:22
tdhooghe and others added 6 commits October 21, 2024 18:05
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
* feat(datasets): create separate `ibis.FileDataset`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): deprecate `TableDataset` file I/O

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets): implement `FileDataset` versioning

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): try `os.path.exists`, for Windows

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* revert(datasets): use pathlib, ignore Windows test

Refs: b7ff0c7

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* docs(datasets): add `ibis.FileDataset` to contents

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets)!: expose `load` and `save` publicly

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): remove second filepath assignment

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
@tdhooghe tdhooghe force-pushed the feature/save-pd-to-snowflaketable branch 2 times, most recently from a40aa99 to fc2b3b8 Compare October 21, 2024 16:23
@tdhooghe tdhooghe marked this pull request as ready for review October 21, 2024 16:24
merelcht and others added 4 commits October 21, 2024 17:29
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
…snowflaketable' into feature/save-pd-to-snowflaketable
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: Thomas <thomas.dhooghe95@gmail.com>
tdhooghe and others added 3 commits October 21, 2024 18:37
Signed-off-by: Thomas <thomas.dhooghe95@gmail.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
@tdhooghe tdhooghe changed the title feat(datasets): save pandas df directly to SnowflakeTable feat(datasets): Improved compatibility, functionality and testing for SnowflakeTableDataset Oct 21, 2024
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants