Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Refactored the ManagedTableDataset by Separating Out Common Table Logic #827

Merged

Conversation

MinuraPunchihewa
Copy link
Contributor

@MinuraPunchihewa MinuraPunchihewa commented Sep 8, 2024

Description

This PR refactors the ManagedTableDataset by separating our the common 'table' (managed and external) logic to a separate BaseTableDataset.

Development notes

The common table logic that is used to execute operations on Databricks Unity Catalog has been separated out here, with the primary purpose to be the implementation of an ExternalTableDataset in the near future.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
…xist

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa MinuraPunchihewa force-pushed the feature/databricks_external_dataset branch from 880f599 to dc3550e Compare September 8, 2024 09:16
@MinuraPunchihewa MinuraPunchihewa force-pushed the feature/databricks_external_dataset branch from 40cf9dc to fef688f Compare October 1, 2024 15:07
@MinuraPunchihewa
Copy link
Contributor Author

@noklam As requested, I have updated this PR to only contain the changes related to the refactor. From the CI runs, it is not entirely clear to me why the lint and unit tests are failing though. Is there anything else that I need to do here?

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks fine for me. If you have to add new argument, make sure it's a non-breaking one, i.e. add it in the end instead of changing the existing order.

Comment on lines 137 to 136
catalog=catalog,
database=database,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the order be preserved? this would be a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit. Does this resolve your concern? Do the order of operations matter here since I am passing in keyword arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset class order doesn't matter because it is already a keyword only class.

The manage table class however is technically still a public class so change of order is a breaking change. It's not gonna affect too much since it's rare but I prefer only introduce breaking change when necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Very clear. Thank you.

@MinuraPunchihewa MinuraPunchihewa force-pushed the feature/databricks_external_dataset branch 2 times, most recently from 7f81222 to fef688f Compare October 1, 2024 16:18
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@ankatiyar
Copy link
Contributor

ankatiyar commented Oct 2, 2024

Hey @MinuraPunchihewa, the refactoring looks good to me but the test coverage for kedro_datasets/databricks/_base_table_dataset.py is not complete which is causing the CI to fail. You might also need to run make plugin=kedro-datasets lint and push the linter changes.

@MinuraPunchihewa
Copy link
Contributor Author

@ankatiyar @noklam I am working on the tests now. Can you please let me know how I can run them?
I tried make plugin=kedro-datasets test and pytest tests/databricks/test_managed_table_dataset.py, but I get the following error:

ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov-report --cov-report term-missing --cov kedro_datasets --cov tests --no-cov-on-fail tests --cov-config pyproject.toml --numprocesses 4 --dist loadfile
  inifile: /home/minura/Documents/work/kedro/repos/kedro-plugins/kedro-datasets/pyproject.toml
  rootdir: /home/minura/Documents/work/kedro/repos/kedro-plugins/kedro-datasets

Is it possible to run only a particular test module with make?

@noklam
Copy link
Contributor

noklam commented Oct 4, 2024

Have you run make plugin=kedro-datasets install-test-requirements already? You are missing the testing dependencies (the error is complainign pytest-cov)

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa MinuraPunchihewa force-pushed the feature/databricks_external_dataset branch from 4941ab5 to 5a05a34 Compare October 4, 2024 14:34
@MinuraPunchihewa
Copy link
Contributor Author

@noklam Thank you; I was able to get it running after installing the dependencies.
@ankatiyar added some tests; what I basically did was take the existing tests for managed_table_dataset.py and moved it to _base_table_dataset.py. Then I added a simple tests for the describe() method of managed_table_dataset.py because that is the only place where it differs. Let me know what you think.
I've fixed the lint issues as well.

MinuraPunchihewa and others added 5 commits October 4, 2024 20:17
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <49385643+MinuraPunchihewa@users.noreply.github.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam, @ankatiyar,
I needed to add a couple of other tests to get the coverage to 100%, but a couple of these tests require an external location to save the external tables in. I've added this in as a fixture which takes the location from an environment variable called DATABRICKS_EXTERNAL_LOCATION. Is it possible to set this up in Databricks environment configured for testing or is there a better way to go about this?

ankatiyar and others added 3 commits October 10, 2024 15:03
Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankatiyar
Copy link
Contributor

I dont believe we have a test databricks setup but we can look at how to deal with the tests when you're ExternalTableDataset PR is ready. I'll try to get this refactor in for now.

@lrcouto lrcouto merged commit d0d9f86 into kedro-org:main Oct 10, 2024
12 checks passed
@MinuraPunchihewa
Copy link
Contributor Author

I dont believe we have a test databricks setup but we can look at how to deal with the tests when you're ExternalTableDataset PR is ready. I'll try to get this refactor in for now.

Thank you, @ankatiyar. Super stoked to have this merged.
Cool. May I know how the unit tests are running though? The unit tests that existed for ManagedTableDataset seemed to require a Spark environment to run from what I could gather.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants