`dbt_utils.deduplicate` macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

yauhen-sobaleu · 2022-10-25T09:46:25Z

Describe the bug

dbt_utils.deduplicate returns zero rows when it deduplicates a CTE with a column that consists entirely of NULL values.

Steps to reproduce

Given model code as:

with data as (

    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
    UNION ALL
    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
),

data_deduped AS (
    {{ dbt_utils.deduplicate(
        relation='data',
        partition_by='id',
        order_by='ts desc',
    ) }}
)

select * from data_deduped

it is compiled by dbt on Redshift adapter into the following SQL:

with data as (

    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
    UNION ALL
    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
),

data_deduped AS (
    with row_numbered as (
        select
            _inner.*,
            row_number() over (
                partition by id
                order by ts desc
            ) as rn
        from data as _inner
    )

    select
        distinct data.*
    from data as data
    
    natural join row_numbered
    where row_numbered.rn = 1
)

select * from data_deduped

Expected results

         ts          | id | null_field 
---------------------+----+------------
 2021-04-03 23:00:26 |  1 |          
(1 row)

Actual results

 ts | id | null_field 
----+----+------------
(0 rows)

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: mjirv/dbt_datamocktool
    version: 0.2.0

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.3.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - redshift: 1.2.1 - Update available!
  - postgres: 1.2.1 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The text was updated successfully, but these errors were encountered:

joellabes · 2022-10-26T03:04:49Z

@yauhen-sobaleu thanks for opening this!

If there are two null rows and a non-null row, does it still lose the null rows? I assume so but your repro steps don't say one way or the other.

giftofjehovah · 2022-11-16T03:57:34Z

I have the same issue. I think it's the natural join that is the culprit here. It joins on all your selected column, and if you have columns where your value is null. It will not return that row.

yauhen-sobaleu · 2022-12-07T14:41:16Z

Yes the problem is in natural join that is not null safe. What I had to do is to add another function parameter that explicitly accepts the list of columns to select from.

Using deduplicate macro from dbt-utils is quite dangerous right know. You never know when NULLs will come out and corrupt your data.

{%- macro deduplicate(relation, partition_by, order_by, columns) -%}
    
    WITH row_numbered AS (
        SELECT
            {% for column in columns %}
            {{ column }},
            {% endfor %}
            ROW_NUMBER() OVER (
                PARTITION BY {{ partition_by }}
                ORDER BY {{ order_by }}
            ) AS rn
        FROM {{ relation }} AS _inner
    )

    SELECT * FROM row_numbered WHERE rn = 1
{%- endmacro -%}

joellabes · 2022-12-08T02:54:00Z

OK thanks for clarifying! I'd welcome a PR that resolved this (and an addition to the integration tests to ensure it stays fixed)

PaulGVernon · 2023-03-22T09:56:42Z

For the record another user hit this issue. See the slack here https://getdbt.slack.com/archives/CBSQTAPLG/p1679007195056989
and also my summary here https://getdbt.slack.com/archives/CU4MRJ7QB/p1679053101978089

TL;DR: The switch in #548 to use a select distinct natural join not only introduced this NULL bug, but also very poor performance compared to a more sane approach and a future bug where data types that are not comparable (such as JSON in Postgres) will cause the natural join to fail.

My humble suggestion is to revert to the old method of using dbt_utils.star

allenwangs · 2023-03-30T03:23:49Z

I also hit the same issue and find out this issue ticket. I am using DuckDB, and the problem also comes from natural join on columns containing NULL values.

dbeatty10 · 2023-04-20T14:32:51Z

Agreed with @PaulGVernon here that the default implementation of deduplicate that uses a natural join is broken.

Furthermore, I don't think the natural join approach can ever work the way we would want. I believe that deduplicate would need to be implemented (and validated) per database in order to work as expected.

My humble suggestion is to revert to the old method of using dbt_utils.star

This idea is compelling.

A couple alternative ideas:

Force a database-specific implementation in dbt-utils (by raising a NotImplementedException) rather than attempting a cross-database one.
Move deduplicate to dbt-core so dbt adapters are encouraged to implement it.

joellabes · 2023-05-03T03:46:21Z

OK I've just spent an afternoon staring at the old and new versions of this macro, and no simple solution has fallen from the sky.

I would love to be able to revert to using dbt_utils.star. However, that requires a real object to be persisted in the database so that its columns can be found. That means no CTEs, in particular no ephemeral models. And that means we're breaking the backwards-compatibility commitments of utils v1.0, because this version of the macro accepts anything.

A couple alternative ideas:

Force a database-specific implementation in dbt-utils (by raising a NotImplementedException) rather than attempting a cross-database one.

Move deduplicate to dbt-core so dbt adapters are encouraged to implement it.

Unless I'm missing something, these would also require star (or in the dbt-core timeline, probably adapter.get_columns_in_relation directly) to work correctly. I guess that this would be a roll-forward situation, where the old buggy behaviour remains and people can move to a correctly-implemented alternative. Even so, I don't think this is a fit for dbt-core - it doesn't feel like it's inherently part of SQL. More like width_bucket or something where Snowflake has a one-line solution but is the outlier.

So a new macro then?
Probably! I think we need to chalk up another win for the humble null and abandon this one with a warning message inside of it. Then create a new macro like remove_duplicates() with some to-be-determined signature. It would be nice for it to continue to support arbitrary CTEs (maybe you can pass in a set of columns manually?) but it must handle nulls correctly, so the default implementation goes back to the row_number strategy.

It strikes me that dbt_utils.pivot doesn't try to be clever; it asks for the column values to be passed in. Maybe we should take a leaf from its book: dbt_utils.remove_duplicates(from, return_columns, partition_by, order_by)

For a model:

{{ dbt_utils.remove_duplicates(
  from=ref('some_model'), 
  return_columns=dbt_utils.get_filtered_columns_in_relation(ref('some_model')), 
  partition_by='id', 
  order_by='order_date desc'
)}}

For a CTE:

{{ dbt_utils.remove_duplicates(
  from='my_cte', 
  return_columns=['id', 'order_date', 'status'], 
  partition_by='id', 
  order_by='order_date desc'
)}}

What do y'all think?

dbeatty10 · 2023-05-03T18:17:10Z

chalk up another win for the humble null

😂

Reflections on the following paths forward:

Move deduplicate to dbt-core
New macro

Move deduplicate to dbt-core

dbt-utils is caught in this funny place where it only supports a small handful of (popular) databases.
For anything inherently part of SQL, we don't need cross-database implementations, 'cause it should "just work".

dbt-utils really works best when it has only a single implementation of a macro and no adapter-specific overrides. It can get caught in difficult places (like this one), when nearly every SQL dialect needs its own implementation. That's where dbt-core + adapters can shine.

Whenever adapter-specific variants are added to a dbt package, it's the perfect moment to stop and consider: there's probably some abstraction that should be in Core!

Adding deduplicate to dbt-core is attractive because many databases can implement the desired behavior. Snowflake, Databricks, and DuckDB all support qualify, and both Postgres and BigQuery have dialect-specific ways to accomplish it.

Since there's not a ubiquitous solution that would work across all dialects, the default implementation in dbt-core would probably be to raise a NotImplementedException.

New macro

To support arbitrary CTEs, there are two main options for databases that don't support something akin to qualify:

accept that row_number() over ... will add a single extra column to your result relation (rn by default, but configurable)
pass in a set of columns manually

Each have their downsides, but the latter seems especially unpleasant.

Overall, the experience for end users of the long tail of dbt adapters is likely to be much better if we move deduplicate to dbt-core rather than wrestle around within dbt-utils.

dbeatty10 · 2023-05-03T18:56:10Z

Some more info...

Support for `qualify`

Today I Learned (TIL) that BigQuery supports qualify!

So here's a running list (bound to be non-exhaustive):

Snowflake qualify
Databricks qualify
BigQuery qualify
DuckDB qualify
H2 qualify

Included the last one just 'cause I really appreciate @fatalmind database writings like the "Can I use..." series that tipped me off to BigQuery's support on this.

Only in dreams

If all data platforms supported both the row_number() window function and select * exclude, then I think we could construct a one-size-fits-all deduplicate. But in a world where qualify is ubiquitous, exclude isn't necessary and the solution is more direct.

yauhen-sobaleu · 2023-07-18T09:54:26Z

Qualify is supported in Redshift now.
I made a PR: #811

yauhen-sobaleu · 2023-07-19T11:48:31Z

@dbeatty10 can you help review the PR? thank you!

yauhen-sobaleu added bug Something isn't working triage labels Oct 25, 2022

joellabes added pending and removed triage labels Oct 26, 2022

joellabes mentioned this issue Dec 5, 2022

Default implementation of deduplicate is not null-safe #621

Open

5 tasks

joellabes added good first issue and removed pending labels Dec 8, 2022

joellabes mentioned this issue Jan 10, 2023

deduplicate drops rows that have missing (NULL) values (Redshift) #749

Closed

5 tasks

graciegoheen mentioned this issue Apr 26, 2023

deduplicate macro for Databricks now uses the QUALIFY clause, which fixes NULL columns issues from the default natural join logic #786

Merged

17 tasks

yauhen-sobaleu mentioned this issue Jul 18, 2023

Use QUALIFY clause in deduplicate macro for Redshift #811

Merged

16 tasks

clintf1982 mentioned this issue Jul 20, 2023

deduplicate bug in Spark in case of a null column. #814

Open

5 tasks

dbeatty10 closed this as completed in #811 Jul 25, 2023

dbeatty10 mentioned this issue Jul 26, 2023

Methods to achieve null safety for deduplicate #815

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dbt_utils.deduplicate` macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

`dbt_utils.deduplicate` macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

yauhen-sobaleu commented Oct 25, 2022

joellabes commented Oct 26, 2022

giftofjehovah commented Nov 16, 2022

yauhen-sobaleu commented Dec 7, 2022

joellabes commented Dec 8, 2022

PaulGVernon commented Mar 22, 2023

allenwangs commented Mar 30, 2023

dbeatty10 commented Apr 20, 2023

joellabes commented May 3, 2023

dbeatty10 commented May 3, 2023

dbeatty10 commented May 3, 2023

yauhen-sobaleu commented Jul 18, 2023

yauhen-sobaleu commented Jul 19, 2023

dbt_utils.deduplicate macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

dbt_utils.deduplicate macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

Comments

yauhen-sobaleu commented Oct 25, 2022

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

joellabes commented Oct 26, 2022

giftofjehovah commented Nov 16, 2022

yauhen-sobaleu commented Dec 7, 2022

joellabes commented Dec 8, 2022

PaulGVernon commented Mar 22, 2023

allenwangs commented Mar 30, 2023

dbeatty10 commented Apr 20, 2023

joellabes commented May 3, 2023

dbeatty10 commented May 3, 2023

Move deduplicate to dbt-core

New macro

dbeatty10 commented May 3, 2023

Support for qualify

Only in dreams

yauhen-sobaleu commented Jul 18, 2023

yauhen-sobaleu commented Jul 19, 2023

`dbt_utils.deduplicate` macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

`dbt_utils.deduplicate` macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

Support for `qualify`