Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery deduplicate macro doesn't support downstream partition pruning #928

Open
austinclooney opened this issue Jul 1, 2024 · 1 comment · May be fixed by #929
Open

BigQuery deduplicate macro doesn't support downstream partition pruning #928

austinclooney opened this issue Jul 1, 2024 · 1 comment · May be fixed by #929
Labels
enhancement New feature or request performance

Comments

@austinclooney
Copy link

Describe the feature

The Bigquery deduplicate macro uses array_agg to deduplicate. The way it is currently set up, any query using the macro will not be able to partition prune downstream of the macro due to the way array_agg interacts with partition pruning because the partition column is not explicitly selected separate from the array agg. This can be avoided in a table by doing the partition filtering in a CTE prior to using the deduplicate macro, but it can't be avoided in a view.

Describe alternatives you've considered

For models materialized as tables, we can add the partition filtering above the deduplication. For views, the alternative is to write the deduplication step ourselves to manual select the partition column outside of the array_agg.

Basically it is written like this:

    select unique.*
    from (
        select
            array_agg (
                original
                order by {{ order_by }}
                limit 1
            )[offset(0)] unique
        from {{ relation }} original
        group by {{ partition_by }}
    )

and instead it needs to look like this:

    select <partition pruning column>, unique.* except(<partition pruning column>)
    from (
        select
            <partition pruning column>,
            array_agg (
                original
                order by {{ order_by }}
                limit 1
            )[offset(0)] unique
        from {{ relation }} original
        group by <partition pruning column>, {{ partition_by }}
    )

Additional context

This is specific to the BQ deuplicate macro.

Who will this benefit?

BQ users who want to deduplicate in a view.

Are you interested in contributing this feature?

I can create a PR that implements the fix we used.

@austinclooney austinclooney added enhancement New feature or request triage labels Jul 1, 2024
@austinclooney
Copy link
Author

I've created a PR with an example solution, but will be happy to work with someone if it doesn't meet standards.

@dbeatty10 dbeatty10 changed the title Bigquery deduplicate macro doesn't support downstream partition pruning. BigQuery deduplicate macro doesn't support downstream partition pruning Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants