Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt_utils.deduplicate macro returns zero rows when deduplicating a CTE with NULL column (Redshift) #713

Closed
1 of 5 tasks
yauhen-sobaleu opened this issue Oct 25, 2022 · 12 comments · Fixed by #811
Closed
1 of 5 tasks
Labels
bug Something isn't working good first issue

Comments

@yauhen-sobaleu
Copy link
Contributor

Describe the bug

dbt_utils.deduplicate returns zero rows when it deduplicates a CTE with a column that consists entirely of NULL values.

Steps to reproduce

Given model code as:

with data as (

    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
    UNION ALL
    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
),

data_deduped AS (
    {{ dbt_utils.deduplicate(
        relation='data',
        partition_by='id',
        order_by='ts desc',
    ) }}
)

select * from data_deduped

it is compiled by dbt on Redshift adapter into the following SQL:

with data as (

    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
    UNION ALL
    select 
        '2021-04-03 23:00:26'::timestamp as ts,
        1 AS id,
        NULL AS null_field
),

data_deduped AS (
    with row_numbered as (
        select
            _inner.*,
            row_number() over (
                partition by id
                order by ts desc
            ) as rn
        from data as _inner
    )

    select
        distinct data.*
    from data as data
    
    natural join row_numbered
    where row_numbered.rn = 1
)

select * from data_deduped

Expected results

         ts          | id | null_field 
---------------------+----+------------
 2021-04-03 23:00:26 |  1 |          
(1 row)

Actual results

 ts | id | null_field 
----+----+------------
(0 rows)

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: mjirv/dbt_datamocktool
    version: 0.2.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.3.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - redshift: 1.2.1 - Update available!
  - postgres: 1.2.1 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation
@yauhen-sobaleu yauhen-sobaleu added bug Something isn't working triage labels Oct 25, 2022
@joellabes
Copy link
Contributor

@yauhen-sobaleu thanks for opening this!

If there are two null rows and a non-null row, does it still lose the null rows? I assume so but your repro steps don't say one way or the other.

@joellabes joellabes added pending and removed triage labels Oct 26, 2022
@giftofjehovah
Copy link

I have the same issue. I think it's the natural join that is the culprit here. It joins on all your selected column, and if you have columns where your value is null. It will not return that row.

@yauhen-sobaleu
Copy link
Contributor Author

Yes the problem is in natural join that is not null safe. What I had to do is to add another function parameter that explicitly accepts the list of columns to select from.

Using deduplicate macro from dbt-utils is quite dangerous right know. You never know when NULLs will come out and corrupt your data.

{%- macro deduplicate(relation, partition_by, order_by, columns) -%}
    
    WITH row_numbered AS (
        SELECT
            {% for column in columns %}
            {{ column }},
            {% endfor %}
            ROW_NUMBER() OVER (
                PARTITION BY {{ partition_by }}
                ORDER BY {{ order_by }}
            ) AS rn
        FROM {{ relation }} AS _inner
    )

    SELECT * FROM row_numbered WHERE rn = 1
{%- endmacro -%}

@joellabes
Copy link
Contributor

OK thanks for clarifying! I'd welcome a PR that resolved this (and an addition to the integration tests to ensure it stays fixed)

@PaulGVernon
Copy link

For the record another user hit this issue. See the slack here https://getdbt.slack.com/archives/CBSQTAPLG/p1679007195056989
and also my summary here https://getdbt.slack.com/archives/CU4MRJ7QB/p1679053101978089

TL;DR: The switch in #548 to use a select distinct natural join not only introduced this NULL bug, but also very poor performance compared to a more sane approach and a future bug where data types that are not comparable (such as JSON in Postgres) will cause the natural join to fail.

My humble suggestion is to revert to the old method of using dbt_utils.star

@allenwangs
Copy link

I also hit the same issue and find out this issue ticket. I am using DuckDB, and the problem also comes from natural join on columns containing NULL values.

@dbeatty10
Copy link
Contributor

Agreed with @PaulGVernon here that the default implementation of deduplicate that uses a natural join is broken.

Furthermore, I don't think the natural join approach can ever work the way we would want. I believe that deduplicate would need to be implemented (and validated) per database in order to work as expected.

My humble suggestion is to revert to the old method of using dbt_utils.star

This idea is compelling.

A couple alternative ideas:

  • Force a database-specific implementation in dbt-utils (by raising a NotImplementedException) rather than attempting a cross-database one.
  • Move deduplicate to dbt-core so dbt adapters are encouraged to implement it.

@joellabes
Copy link
Contributor

OK I've just spent an afternoon staring at the old and new versions of this macro, and no simple solution has fallen from the sky.

I would love to be able to revert to using dbt_utils.star. However, that requires a real object to be persisted in the database so that its columns can be found. That means no CTEs, in particular no ephemeral models. And that means we're breaking the backwards-compatibility commitments of utils v1.0, because this version of the macro accepts anything.

A couple alternative ideas:

  • Force a database-specific implementation in dbt-utils (by raising a NotImplementedException) rather than attempting a cross-database one.
  • Move deduplicate to dbt-core so dbt adapters are encouraged to implement it.

Unless I'm missing something, these would also require star (or in the dbt-core timeline, probably adapter.get_columns_in_relation directly) to work correctly. I guess that this would be a roll-forward situation, where the old buggy behaviour remains and people can move to a correctly-implemented alternative. Even so, I don't think this is a fit for dbt-core - it doesn't feel like it's inherently part of SQL. More like width_bucket or something where Snowflake has a one-line solution but is the outlier.

So a new macro then?
Probably! I think we need to chalk up another win for the humble null and abandon this one with a warning message inside of it. Then create a new macro like remove_duplicates() with some to-be-determined signature. It would be nice for it to continue to support arbitrary CTEs (maybe you can pass in a set of columns manually?) but it must handle nulls correctly, so the default implementation goes back to the row_number strategy.

It strikes me that dbt_utils.pivot doesn't try to be clever; it asks for the column values to be passed in. Maybe we should take a leaf from its book: dbt_utils.remove_duplicates(from, return_columns, partition_by, order_by)

For a model:

{{ dbt_utils.remove_duplicates(
  from=ref('some_model'), 
  return_columns=dbt_utils.get_filtered_columns_in_relation(ref('some_model')), 
  partition_by='id', 
  order_by='order_date desc'
)}}

For a CTE:

{{ dbt_utils.remove_duplicates(
  from='my_cte', 
  return_columns=['id', 'order_date', 'status'], 
  partition_by='id', 
  order_by='order_date desc'
)}}

What do y'all think?

@dbeatty10
Copy link
Contributor

chalk up another win for the humble null

😂

Reflections on the following paths forward:

  1. Move deduplicate to dbt-core
  2. New macro

Move deduplicate to dbt-core

dbt-utils is caught in this funny place where it only supports a small handful of (popular) databases.
For anything inherently part of SQL, we don't need cross-database implementations, 'cause it should "just work".

dbt-utils really works best when it has only a single implementation of a macro and no adapter-specific overrides. It can get caught in difficult places (like this one), when nearly every SQL dialect needs its own implementation. That's where dbt-core + adapters can shine.

Whenever adapter-specific variants are added to a dbt package, it's the perfect moment to stop and consider: there's probably some abstraction that should be in Core!

Adding deduplicate to dbt-core is attractive because many databases can implement the desired behavior. Snowflake, Databricks, and DuckDB all support qualify, and both Postgres and BigQuery have dialect-specific ways to accomplish it.

Since there's not a ubiquitous solution that would work across all dialects, the default implementation in dbt-core would probably be to raise a NotImplementedException.

New macro

To support arbitrary CTEs, there are two main options for databases that don't support something akin to qualify:

  • accept that row_number() over ... will add a single extra column to your result relation (rn by default, but configurable)
  • pass in a set of columns manually

Each have their downsides, but the latter seems especially unpleasant.

Overall, the experience for end users of the long tail of dbt adapters is likely to be much better if we move deduplicate to dbt-core rather than wrestle around within dbt-utils.

@dbeatty10
Copy link
Contributor

Some more info...

Support for qualify

Today I Learned (TIL) that BigQuery supports qualify!

So here's a running list (bound to be non-exhaustive):

Included the last one just 'cause I really appreciate @fatalmind database writings like the "Can I use..." series that tipped me off to BigQuery's support on this.

Only in dreams

If all data platforms supported both the row_number() window function and select * exclude, then I think we could construct a one-size-fits-all deduplicate. But in a world where qualify is ubiquitous, exclude isn't necessary and the solution is more direct.

@yauhen-sobaleu
Copy link
Contributor Author

Qualify is supported in Redshift now.
I made a PR: #811

@yauhen-sobaleu
Copy link
Contributor Author

@dbeatty10 can you help review the PR? thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants