Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-633] [Feature] Support Hybrid tables #668

Open
3 tasks done
b-per opened this issue Jun 20, 2023 · 19 comments
Open
3 tasks done

[ADAP-633] [Feature] Support Hybrid tables #668

b-per opened this issue Jun 20, 2023 · 19 comments
Labels
help_wanted Extra attention is needed pkg:dbt-snowflake type:enhancement New feature or request

Comments

@b-per
Copy link
Contributor

b-per commented Jun 20, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

We could add support for Unistore/Hybrid tables by allowing users to materialize hybrid tables from dbt.

I couldn't find an official docs page for the feature but there is a blog post talking about it.

The syntax examples shown are CREATE HYBRID TABLE xxx or CREATE OR REPLACE HYBRID TABLE xxx but I don't know if it supports CREATE HYBRID TABLE AS () or not.

Describe alternatives you've considered

People could create their own materialization or update some of the core macros to allow creating those tables.

Who will this benefit?

Snowflake customers who would like to use the Unistore capabilities.

Are you interested in contributing this feature?

No response

Anything else?

No response

@b-per b-per added type:enhancement New feature or request triage:product labels Jun 20, 2023
@github-actions github-actions bot changed the title [Feature] Support Hybrid tables [ADAP-633] [Feature] Support Hybrid tables Jun 20, 2023
@dataders
Copy link
Contributor

I agree this would be awesome to support. Do you have any materialization as to why node type it should be? table? It's own materialization?

Like you said, provide that Dynamic Tables support the standard DDL, the engineering effort to support them is low and this could in theory be supported in 1.7 or 1.8

my questions:

  1. what does the ideal workflow/ergonomics for using Hybrid tables look like?
  2. any idea what other settings might be needed?
  3. do other data platforms offer a similar feature that would justify including a default implementation in core that adapters may override?

If someone is interested to help out with this, I'd suggest a good starting point is to make a custom hybrid_table materialization, based on dbt-snowflake's table materialization.

@dataders dataders added help_wanted Extra attention is needed awaiting_response and removed triage:product labels Jun 23, 2023
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 22, 2023
@b-per
Copy link
Contributor Author

b-per commented Sep 22, 2023

Commenting to remove the stale tag! I think it would still be useful.

@dwreeves
Copy link
Contributor

Bumping this as hybrid tables are now in public preview, and this is something we want.

This will involve a little more elbow grease for the DDL statement than with transient vs permanent tables.

@migeraghty
Copy link

Bumping this again as hybrid tables are a perfect use case for our DBT snapshotting processes. We also want the ability to enforce primary key and foreign key constraints, and this should in theory be very low lift to add in DBT.

@GiorgiModebadzeW
Copy link

Usecase:
At docker we want to use hybrid tables as reverse etl backend for our user facing application- where we share some data with clients about docker usage. etc.
because snowflakes original tables are not set for high concurrency read requests we want to utilize hybrid tables as the replacement of postgres and avoid copying data from snowflake tables (generated by dbt) to postgres.

Materialization:
I think supporting both normal table and incremental load for hybrid tables would be ideal usecase for us. So having two new materialization type: hybrid_incremental and hybrid_table would be great. we could even avoid creating new materialization and create new type for snowflake adapter which by default would be normal and if can be specified as hybrid:

{{ config(materialized='table',
   type='hybrid'
 ) }}

{{ config(materialized='incremental',
   type='hybrid'
 ) }}

thank you and looking forward to this

@dakota-kelley-phData
Copy link

Do y'all have someone to take on this development? I wrote a custom materialization to make use of hybrid tables, I can refactor the code I have and I would be more than happy to take on the development of this.

@sl-eagle
Copy link

+1 bump

@jtmcn
Copy link

jtmcn commented Jun 21, 2024

@dakota-kelley-phData is the materializatization code publically available?

@dakota-kelley-phData
Copy link

@jtmcn, I don't know if I do. Let me do some general clean up on it and get it in a repo where others can use it. I'll tackle that this weekend and get it shared here for anyone to use 😄

@nathan-beam
Copy link

Bumping! Would love this feature

@jtmcn
Copy link

jtmcn commented Jul 15, 2024

@dakota-kelley-phData I'll likely be writing our own implementation of this soon. Any code you can share would be helpful. Even if it's not cleaned-up yet.

@aleenprd
Copy link

Bumb bumb bump! Would love to have this feature. Also happy to contribute if some developer can guide me a bit. :)

@b-per
Copy link
Contributor Author

b-per commented Sep 24, 2024

As Hybrid Tables can be created with CTAS it should be quite straightforward for someone to create a custom materialization for it.

This new materialization could be based of (e.g. copied from and then modified) the table one, easier, but not managing incremental loads.
Or it could be based of the incremental one to support incremental loads to Hybrid Tables.

@rothrock
Copy link

yes, I'd use this feature.

@sfc-gh-jdemlow
Copy link

I have come up with something that seems to be working pretty decently obviously there would still need to be a lot of testing and I ran into some oddities with dbt trying to combine my schemas together so I wrote a marco to tell dbt to stop being smart.

This is a very much rough draft, but this does seem to work pretty decently so far if you have questions or adjustment I might be able to make this repo public. Either way I hope this helps someone down the line

custom_macro.sql

{% macro generate_schema_name(custom_schema_name, node) -%}
  {{ custom_schema_name | trim }}
{%- endmacro %}

schema.yml

version: 2

models:
  - name: pre_aggregated_events
    description: "Pre-aggregated impression events for livestreams and ad campaigns using hybrid table"
    config:
      materialized: hybrid_table
      schema: SILVER  # Optional: Specify a custom schema
      database: HYBRID_DB  # Optional: Specify a custom database
      force_ctas: false  # Optional: Set to true when you want to force CTAS
      column_definitions:
        CUSTOMER_ID: VARCHAR(16777216) NOT NULL
        AD_CAMPAIGN_ID: VARCHAR(16777216) NOT NULL
        COUNT_IMPRESSIONS: NUMBER(38,0)
        TOTAL_WATCH_TIME: FLOAT
        INSERT_RAW_TIMESTAMP: TIMESTAMP_NTZ(9)
        INSERT_HYBRID_TIMESTAMP: TIMESTAMP_NTZ(9)
        UPDATED_AT: TIMESTAMP_NTZ(9)
        IS_UPDATED: BOOLEAN
      primary_key: LIVESTREAM_ID, AD_CAMPAIGN_ID
      indexes:
        - name: IDX_CUSTOMER
          columns: [CUSTOMER_ID]
        - name: IDX_AD_CAMPAIGN
          columns: [AD_CAMPAIGN_ID]

SQL CALL

{{ config(materialized='hybrid_table') }}

{% set target_relation = this %}
{% set existing_relation = load_relation(this) %}

{% if existing_relation is none %}
    -- Table doesn't exist, use CTAS
    WITH activity_data AS (
        SELECT
            CUSTOMER_ID,
            AD_CAMPAIGN_ID,
            COUNT(*) AS COUNT_IMPRESSIONS,
            SUM(WATCH_TIME_SECS) AS TOTAL_WATCH_TIME,
            MAX(INSERT_RAW_TIMESTAMP) AS INSERT_RAW_TIMESTAMP
        FROM {{ ref('stg_activities') }}
        WHERE EVENT_NAME = 'impression' AND STREAM_ACTION != 'DELETE'
        GROUP BY CUSTOMER_ID, AD_CAMPAIGN_ID
    )
    SELECT
        ad.CUSTOMER_ID,
        ad.AD_CAMPAIGN_ID,
        ad.COUNT_IMPRESSIONS,
        ad.TOTAL_WATCH_TIME,
        ad.INSERT_RAW_TIMESTAMP,
        CURRENT_TIMESTAMP() AS INSERT_HYBRID_TIMESTAMP,
        CURRENT_TIMESTAMP() AS UPDATED_AT,
        FALSE AS IS_UPDATED
    FROM activity_data ad
{% else %}
    -- Table exists, use MERGE
    MERGE INTO {{ target_relation }} t
    USING (
        SELECT 
            CUSTOMER_ID,
            AD_CAMPAIGN_ID,
            COUNT(*) AS COUNT_IMPRESSIONS,
            SUM(WATCH_TIME_SECS) AS TOTAL_WATCH_TIME,
            MAX(INSERT_RAW_TIMESTAMP) AS INSERT_RAW_TIMESTAMP,
            CURRENT_TIMESTAMP() AS INSERT_HYBRID_TIMESTAMP
        FROM {{ ref('stg_activities') }}
        WHERE EVENT_NAME = 'impression' AND STREAM_ACTION != 'DELETE'
        GROUP BY CUSTOMER_ID, AD_CAMPAIGN_ID
    ) s
    ON t.CUSTOMER_ID = s.CUSTOMER_ID AND t.AD_CAMPAIGN_ID = s.AD_CAMPAIGN_ID
    WHEN MATCHED THEN
        UPDATE SET 
            t.COUNT_IMPRESSIONS = t.COUNT_IMPRESSIONS + s.COUNT_IMPRESSIONS,
            t.TOTAL_WATCH_TIME = t.TOTAL_WATCH_TIME + s.TOTAL_WATCH_TIME,
            t.INSERT_RAW_TIMESTAMP = GREATEST(t.INSERT_RAW_TIMESTAMP, s.INSERT_RAW_TIMESTAMP),
            t.INSERT_HYBRID_TIMESTAMP = s.INSERT_HYBRID_TIMESTAMP,
            t.UPDATED_AT = s.INSERT_HYBRID_TIMESTAMP,
            t.IS_UPDATED = TRUE
    WHEN NOT MATCHED THEN
        INSERT (CUSTOMER_ID, AD_CAMPAIGN_ID, COUNT_IMPRESSIONS, TOTAL_WATCH_TIME, INSERT_RAW_TIMESTAMP, INSERT_HYBRID_TIMESTAMP, UPDATED_AT, IS_UPDATED)
        VALUES (s.CUSTOMER_ID, s.AD_CAMPAIGN_ID, s.COUNT_IMPRESSIONS, s.TOTAL_WATCH_TIME, s.INSERT_RAW_TIMESTAMP, s.INSERT_HYBRID_TIMESTAMP, s.INSERT_HYBRID_TIMESTAMP, FALSE)
{% endif %}

HYBRID TABLES CUSTOM MATERIALIZATION

{% materialization hybrid_table, adapter='snowflake' %}
  {%- set config = model['config'] -%}

  {% do log("Hybrid table materialization for model: " ~ model.name, info=True) %}
  {% do log("Config: " ~ config, info=True) %}
  {% do log("Target schema: " ~ target.schema, info=True) %}
  {% do log("Config schema: " ~ config.get('schema'), info=True) %}
  {% do log("Model schema: " ~ model.schema, info=True) %}

  {%- set target_schema = config.get('schema', target.schema) -%}
  
  {% do log("Final target schema: " ~ target_schema, info=True) %}

  {% set target_relation = api.Relation.create(
    database=target.database,
    schema=config.get('schema', target.schema),
    identifier=model.alias
  ) %}

  {% do log("Target relation: " ~ target_relation, info=True) %}

  {%- set existing_relation = adapter.get_relation(
        database=target_relation.database,
        schema=target_relation.schema,
        identifier=target_relation.identifier) -%}

  {% do log("Existing relation: " ~ existing_relation, info=True) %}

  {%- set column_definitions = config.get('column_definitions', {}) -%}
  {%- set primary_key = config.get('primary_key', []) -%}
  {%- set primary_key = primary_key if primary_key is string else (primary_key | join(', ')) -%}
  {%- set indexes = config.get('indexes', []) -%}
  {%- set force_ctas = config.get('force_ctas', false) -%}

  -- Run pre-hooks
  {{ run_hooks(pre_hooks) }}

  {% if existing_relation is none or force_ctas %}
    -- Table doesn't exist or force_ctas is true, use CTAS
    {% call statement('main') -%}
      CREATE OR REPLACE HYBRID TABLE {{ target_relation }} (
        {% for column, definition in column_definitions.items() %}
          {{ column }} {{ definition }}{% if not loop.last %},{% endif %}
        {% endfor %}
        {% if primary_key %},
        PRIMARY KEY ({{ primary_key }})
        {% endif %}
        {% for index in indexes %}
        , INDEX {{ index.name }}({{ index.columns | join(', ') }})
        {% endfor %}
      ) AS (
        {{ sql }}
      )
    {%- endcall %}
  {% else %}
    -- Table exists, use MERGE
    {% call statement('main') -%}
      {{ sql }}
    {%- endcall %}
  {% endif %}

  -- Run post-hooks
  {{ run_hooks(post_hooks) }}

  -- Return the relations created in this materialization
  {{ return({'relations': [target_relation]}) }}

{% endmaterialization %}

This application is not part of the Snowflake Service and is governed by the terms in LICENSE, unless expressly agreed to in writing. You use this application at your own risk, and Snowflake has no obligation to support your use of this application.

@rodrigoreis22
Copy link

It'd be great if dbt had native support for Snowflake's hybrid tables.

@azouari
Copy link

azouari commented Dec 25, 2024

Looking forward to have support for Hybrid tables! Currently, my solution is replicate the final layer into Hybrid tables.

@sfc-gh-jdemlow
Copy link

@azouari have you tried the custom materialization above this should allow you to use hybrid tables effectively until this feature is fully supported by DBT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help_wanted Extra attention is needed pkg:dbt-snowflake type:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests