Skip to content

ChrisGutknecht/data-quality-with-stakeholders-and-dbt

Repository files navigation

Welcome to my Session for "Building data quality with stakeholders"

noslides


1. INTRO. Data quality is like a campfire 🔥


  • We want people to stay and spend time on data quality... but HOW?

campfire

  • How do people gather round a campfire? Marshmallows!

campfire-with-marshmallows

  • Marshmallows:
    • Data events that stakeholders really care about:
    • Example. Anomalies on leading indicators, which they don't have to monitor
  • BUT we need to mix in healthy snacks = boring tests like unique and not_null

2. POLL. Let's have a Live poll to get know you!

Link: https://www.menti.com/alxmw7icd8ks Code: 4553 5553

poll


3. DEMO. Let's dive into some examples!

Types of heavily used data tests @ Bergzeit 🏔️

# Test type Example use cases Stakeholder interaction
1 Live alerts 404 errors High
2 Volume changes Googlebot crawl volume Medium
3 Date completeness Missing dates Low
4 dbt generic tests not_null, unique None

Live alerts demo:

See recording: https://www.linkedin.com/events/buildingdataqualitywithyourstak7243873241482579968/comments/


Our daily data quality alert forwarding pipeline @ Bergzeit 🏔️

  • We use dbt cloud to manage models and run tests
  • We store the test failures with an on-run-end macro store_test_results in BigQuery
  • A PubSub topic Listens on table updates with from Logs explorer sink and and trigger a cloud function
  • A post_dbt_test_result cloud function queries the test_results dbt model, converts the dataframe to HTML and sends it to MS teams

This is how ChatGPT visualizes the process: :-D

poll


How we resolve data tests with our stakeholders

  • Use dbt Explore to get the compiled code of the failed test and run the query

  • For stakeholders, provide additional Looker Studio Dashboards for Data Visualization

  • Rotate daily first response duty for test failure among data team

    • First time failure? Threshold too strict? Deduplication via a window function?
    • Is the alert meaningful? Is there an adequate business response, or nothing to be done?
  • Keep improving model description to 1) provide sufficient context, 2) describe worst case Scenarios und 3) specific resolution steps


Best practices for improving data test resolution and collaboration

  • Route different tests into differente channels (Teams, Slack) using tags, either on model and test level
  - dbt_expectations.expect_table_row_count_to_be_between:
      # The three cloud schedulder jobs for DACH query 350 URLs each, thus a total min amount is expected
      min_value: 900
      row_condition: "date = current_date()" # (Optional)
      strictly: false
      config:
        severity: warn
        tags: ["analytics-alerts"]
  • use owner tags on all models
  - name: stg_gsc_inspection_logs
    description: >
      This model lists the search console inspection logs for a list daily tested URLs. 
      See source description for more details
    meta:
      owner: "@Chris G"
  • use group tags for all folders as fallback for model ownership
census_syncs:
  +group: data_team

channel_attribution:
  +group: customer_acquisition

Resources:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages