Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic #786

Merged
merged 13 commits into from
Jul 8, 2023

Conversation

graciegoheen
Copy link
Contributor

@graciegoheen graciegoheen commented Apr 26, 2023

Related to issues described in #713 and #621

It looks like we don't have circle-ci running integration tests for Databricks - should we add that? I can alternatively add this to spark_utils. Additionally, I'd like to add some additional integration tests to confirm this macro will work for null values, but I think that we would need to update the default version of this macro and remove the natural join entirely.

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

In databricks, natural join was causing issues for a customer - we can use the new QUALIFY in databricks to update the deduplicate macro.

Checklist

  • This code is associated with an Issue which has been triaged and accepted for development.
  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt.type_* macros instead of explicit datatypes (e.g. dbt.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@NortySpock
Copy link

@graciegoheen Thank you very much for creating this pull request. Dropping this macro into my project's macro folder saved me at least a half-day of "fix it" work and frantic testing after I stumbled into #713

To the maintainers: This worked for me as a one-off solution in a Databricks-powered project at work, and while I have not exhaustively tested it, it definitely fixed the "I am missing most of the data for some strange reason" data problems that I was seeing post-deduplication from #713

@dbeatty10 dbeatty10 changed the title Updated deduplicate macro to use QUALIFY for databricks deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic Jul 6, 2023
Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graciegoheen I'm going to merge this as-is. Rationale below.


It looks like we don't have circle-ci running integration tests for Databricks - should we add that?

I made an initial attempt at adding Databricks to CI, but it didn't work. The cause appears related to the integration tests using pre-releases of dbt-core (1.6.0bx) that are somehow incompatible with the latest available version of dbt-databricks (1.5.x).

So I'm going to defer that decision to a later date.

I can alternatively add this to spark_utils.

Since we already have databricks__get_table_types_sql (#769), it seems reasonable to add databricks__deduplicate here also (rather than putting it in spark_utils).

Additionally, I'd like to add some additional integration tests to confirm this macro will work for null values, but I think that we would need to update the default version of this macro and remove the natural join entirely.

There's a simple test case in #713 that would work great. But if we add it to CI without changing the default implementation, then Redshift will start failing CI.

Since removing the natural join will take more time and thought, I'm going to defer adding new integration tests to a later date as well.

@dbeatty10 dbeatty10 added this pull request to the merge queue Jul 8, 2023
Merged via the queue into main with commit b140256 Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants