`deduplicate` macro for Databricks now uses the `QUALIFY` clause, which fixes `NULL` columns issues from the default natural join logic #786

graciegoheen · 2023-04-26T17:34:03Z

Related to issues described in #713 and #621

It looks like we don't have circle-ci running integration tests for Databricks - should we add that? I can alternatively add this to spark_utils. Additionally, I'd like to add some additional integration tests to confirm this macro will work for null values, but I think that we would need to update the default version of this macro and remove the natural join entirely.

This is a:

documentation update
bug fix with no breaking changes
new functionality
a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

In databricks, natural join was causing issues for a customer - we can use the new QUALIFY in databricks to update the deduplicate macro.

Checklist

NortySpock · 2023-07-06T00:46:47Z

@graciegoheen Thank you very much for creating this pull request. Dropping this macro into my project's macro folder saved me at least a half-day of "fix it" work and frantic testing after I stumbled into #713

To the maintainers: This worked for me as a one-off solution in a Databricks-powered project at work, and while I have not exhaustively tested it, it definitely fixed the "I am missing most of the data for some strange reason" data problems that I was seeing post-deduplication from #713

dbeatty10

@graciegoheen I'm going to merge this as-is. Rationale below.

It looks like we don't have circle-ci running integration tests for Databricks - should we add that?

I made an initial attempt at adding Databricks to CI, but it didn't work. The cause appears related to the integration tests using pre-releases of dbt-core (1.6.0bx) that are somehow incompatible with the latest available version of dbt-databricks (1.5.x).

So I'm going to defer that decision to a later date.

I can alternatively add this to spark_utils.

Since we already have databricks__get_table_types_sql (#769), it seems reasonable to add databricks__deduplicate here also (rather than putting it in spark_utils).

Additionally, I'd like to add some additional integration tests to confirm this macro will work for null values, but I think that we would need to update the default version of this macro and remove the natural join entirely.

There's a simple test case in #713 that would work great. But if we add it to CI without changing the default implementation, then Redshift will start failing CI.

Since removing the natural join will take more time and thought, I'm going to defer adding new integration tests to a later date as well.

graciegoheen added 2 commits April 26, 2023 10:13

Updated deduplicate macro to use QUALIFY for databricks

824c3c6

updated changelog

856e0ec

graciegoheen requested review from joellabes and dbeatty10 April 26, 2023 17:34

typo

6ee47ab

Merge branch 'main' into fix/databricks_deduplicate_qualify

7f4a118

dbeatty10 changed the title ~~Updated deduplicate macro to use QUALIFY for databricks~~ deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic Jul 6, 2023

dbeatty10 and others added 9 commits July 6, 2023 16:32

Add sample profile for integration tests for Databricks

e9199e5

Add Databricks to CI

5501005

Add databricks to dev-requirements.txt for bleeding edge

1eefdac

Add databricks to dev-requirements.txt for bleeding edge

b980138

Remove bleeding edge dbt-databricks from dev-requirements.txt

f71c02f

Test that input columns are NULL-safe

3e299ed

Test that input columns are NULL-safe

fc1d8cf

Remove Databricks from CI

6049a9e

Remove integration tests for deduplicate with NULL values

b909515

dbeatty10 approved these changes Jul 8, 2023

View reviewed changes

dbeatty10 added this pull request to the merge queue Jul 8, 2023

Merged via the queue into main with commit b140256 Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`deduplicate` macro for Databricks now uses the `QUALIFY` clause, which fixes `NULL` columns issues from the default natural join logic #786

`deduplicate` macro for Databricks now uses the `QUALIFY` clause, which fixes `NULL` columns issues from the default natural join logic #786

graciegoheen commented Apr 26, 2023 •

edited

Loading

NortySpock commented Jul 6, 2023

dbeatty10 left a comment

deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic #786

deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic #786

Conversation

graciegoheen commented Apr 26, 2023 • edited Loading

Description & motivation

Checklist

NortySpock commented Jul 6, 2023

dbeatty10 left a comment

Choose a reason for hiding this comment

`deduplicate` macro for Databricks now uses the `QUALIFY` clause, which fixes `NULL` columns issues from the default natural join logic #786

`deduplicate` macro for Databricks now uses the `QUALIFY` clause, which fixes `NULL` columns issues from the default natural join logic #786

graciegoheen commented Apr 26, 2023 •

edited

Loading