Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column names with spaces #782

Closed
4 tasks
RobbertDM opened this issue Apr 19, 2023 · 12 comments
Closed
4 tasks

Column names with spaces #782

RobbertDM opened this issue Apr 19, 2023 · 12 comments
Labels

Comments

@RobbertDM
Copy link

RobbertDM commented Apr 19, 2023

Describe the bug

I have some column names with spaces in them.
I use

{{ dbt_utils.generate_surrogate_key( ["column with spaces"] ) }} as id,

which compiles to

lower(to_hex(md5(to_utf8(cast(coalesce(cast(column with spaces as varchar), '_dbt_utils_surrogate_key_null_') as varchar))))) as dp3_id,

Which is not valid SQL for my database.

Steps to reproduce

Run generate_surrogate_key on a column name that contains spaces.

Expected results

Working SQL

Actual results

SQL error

Screenshots and log output

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • [ x ] other (specify: Trino)

The output of dbt --version:

Core:
  - installed: 1.4.5
  - latest:    1.4.5 - Up to date!

Plugins:
  - duckdb:   1.4.1 - Up to date!
  - trino:    1.4.2 - Up to date!
  - postgres: 1.4.5 - Up to date!

Additional context

I think we can add some " quotes here, but I'm not sure if that would break other databases:
https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/generate_surrogate_key.sql#L18

Maybe it's also a possibility to use {{ adapter.quote(column) }}

Are you interested in contributing the fix?

Time limitations :(

@RobbertDM RobbertDM added bug Something isn't working triage labels Apr 19, 2023
@joellabes
Copy link
Contributor

Hey @RobbertDM,

You can resolve this at your query's level by including more quotes:

{{ dbt_utils.generate_surrogate_key( ['"column with spaces"'] ) }} as id,

The first set of quotes will be consumed by the Jinja parser but the second set will pass through correctly.

@joellabes joellabes removed bug Something isn't working triage labels Apr 20, 2023
@RobbertDM
Copy link
Author

Thanks @joellabes , that could work in some cases. Tbh it feels a bit like a workaround.

When getting columns from get_filtered_columns_in_relation,

{% set column_names_filtered = dbt_utils.get_filtered_columns_in_relation(
    from=ref('my_model'),
    except=["column_not_to_include_in_id"]) %}

{{ dbt_utils.generate_surrogate_key(
    column_names_filtered
) }} as id,

I would probably need to do something like

{%- set column_names_quoted = [] -%}
{%- for col in column_names_filtered -%}
    {%- do column_names_quoted.append(
        adapter.quote(col)
    ) -%}
{%- endfor -%}

To map those columns to their quoted equivalent, right?

If we could add that adapter.quote(col) to dbt_utils.generate_surrogate_key, then you can use the same code, whether your columns have crazy names or not. What do you think?

@RobbertDM
Copy link
Author

Something like

Modifying
https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/generate_surrogate_key.sql#L18
"coalesce(cast(" ~ field ~ " as " ~ dbt.type_string() ~ "), '" ~ default_null_value ~"')"

To become
"coalesce(cast(" ~ adapter.quote(field) ~ " as " ~ dbt.type_string() ~ "), '" ~ default_null_value ~"')"

@joellabes
Copy link
Contributor

My first reaction is: "🤯 How many columns are you referencing that it's better to pull them from get_filtered_columns_in_relation?"

What is your use case? (In case it's a snapshot, a reminder that check_cols = all exists for this purpose)

Beyond that, adapter.quote is not a silver bullet; arbitrarily quoting extra columns causes problems. Consider this on Snowflake:

with a as (
    select 1 as id
)

select id, ID, "id", "ID"
from a

Three of the four will work, but "id" will be rejected as an invalid identifier.
image

Quoting all the lowercase columns already in the wild would prevent them from being successfully referenced.

With all that said, I could get on board with a "quote_identifiers" argument – which defaults to False – in the same vein as the existing option on star() and pivot(). If you or someone else wanted to open a PR adding that support and tests to validate that it works correctly, I'll happily review it!

@joellabes joellabes reopened this Apr 20, 2023
@joellabes joellabes added good first issue enhancement New feature or request labels Apr 20, 2023
@RobbertDM
Copy link
Author

RobbertDM commented Apr 21, 2023

Hey Joel,

  • Yes indeed, unfortunately we have some very wide tables (biggest I've seen has 2k columns).
  • It's indeed for snapshots, but a modified version of it that takes an extra timestamp column. We don't want to take that extra timestamp column into account, and it's changing all the time, so all doesn't work for us. We generate dbt code from yaml, and get_columns_in_relation works for any table and we can exclude that one column that we don't need.
  • Thanks for the Snowflake example, helps broadening my (very narrow) view, I've only used Trino so far.
  • quote_identifiers argument seems like an elegant solution 👍 I don't think I'll find the time soon, but I really like the idea.

@hui-jie-lim
Copy link

hmm, i might be able to help out here. Might take a while thought. If its alright with you @joellabes , maybe you can assign this to me?

@NagarajNune
Copy link

NagarajNune commented Sep 25, 2023

Hi Guys,

I am a newbie to DBT.

@RobbertDM why can't we use dbt_utils.slugify on top of the column names which have spaces and replace them with underscore

something like this
{{ dbt_utils.generate_surrogate_key([dbt_utils.slugify('field_a'), dbt_utils.slugify('field_b')]) }}

Everyone please correct me If I am wrong.

@RobbertDM
Copy link
Author

RobbertDM commented Sep 25, 2023

Hey @NagarajNune, say you indeed have a table with one column field a. Then slugify would transform it into field_a and the whole thing would compile to
lower(to_hex(md5(to_utf8(cast(coalesce(cast(field_a as varchar), '_dbt_utils_surrogate_key_null_') as varchar))))) as dp3_id,

When executing on your database, the above would throw something along the lines of field_a not found, since field_a indeed does not exist in your table, whereas field a does.

@NagarajNune
Copy link

Hey @NagarajNune, say you indeed have a table with one column field a. Then slugify would transform it into field_a and the whole thing would compile to lower(to_hex(md5(to_utf8(cast(coalesce(cast(field_a as varchar), '_dbt_utils_surrogate_key_null_') as varchar))))) as dp3_id,

When executing on your database, the above would throw something along the lines of field_a not found, since field_a indeed does not exist in your table, whereas field a does.

Yes, this is correct, but in our project normally we will replace all spaces and special characters in column names while reading them from database itself.

Example:
let take if we have a data point which we loaded in to our warehouse.
data_point:
{
"First name":"Nagaraj",
"Last name":"Nune"
}

we will compile it in the following way

select 
data_point:"First name" as first_name,
data_point:"last name" as last_name,
from raw_data

I have taken this a reference with respect to database structure we have in our current project in snowflake.

@RobbertDM
Copy link
Author

In that case, indeed, slugify might work for you :) But not everyone has that renaming step.

Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 24, 2024
Copy link

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants