- We want people to stay and spend time on data quality... but HOW?
- How do people gather round a campfire? Marshmallows!
- Marshmallows:
- Data events that stakeholders really care about:
- Example. Anomalies on leading indicators, which they don't have to monitor
- BUT we need to mix in healthy snacks = boring tests like
unique
andnot_null
Link: https://www.menti.com/alxmw7icd8ks Code: 4553 5553
# | Test type | Example use cases | Stakeholder interaction |
---|---|---|---|
1 | Live alerts | 404 errors | High |
2 | Volume changes | Googlebot crawl volume | Medium |
3 | Date completeness | Missing dates | Low |
4 | dbt generic tests | not_null, unique | None |
See recording: https://www.linkedin.com/events/buildingdataqualitywithyourstak7243873241482579968/comments/
- We use dbt cloud to manage models and run tests
- We store the test failures with an
on-run-end
macrostore_test_results
in BigQuery - A PubSub topic Listens on table updates with from Logs explorer sink and and trigger a cloud function
- A
post_dbt_test_result
cloud function queries thetest_results
dbt model, converts the dataframe to HTML and sends it to MS teams
This is how ChatGPT visualizes the process: :-D
-
Use dbt Explore to get the compiled code of the failed test and run the query
-
For stakeholders, provide additional Looker Studio Dashboards for Data Visualization
-
Rotate daily first response duty for test failure among data team
- First time failure? Threshold too strict? Deduplication via a window function?
- Is the alert meaningful? Is there an adequate business response, or nothing to be done?
-
Keep improving model description to 1) provide sufficient context, 2) describe worst case Scenarios und 3) specific resolution steps
- Route different tests into differente channels (Teams, Slack) using tags, either on model and test level
- dbt_expectations.expect_table_row_count_to_be_between:
# The three cloud schedulder jobs for DACH query 350 URLs each, thus a total min amount is expected
min_value: 900
row_condition: "date = current_date()" # (Optional)
strictly: false
config:
severity: warn
tags: ["analytics-alerts"]
- use owner tags on all models
- name: stg_gsc_inspection_logs
description: >
This model lists the search console inspection logs for a list daily tested URLs.
See source description for more details
meta:
owner: "@Chris G"
- use group tags for all folders as fallback for model ownership
census_syncs:
+group: data_team
channel_attribution:
+group: customer_acquisition
- dbt expectations: https://github.com/calogica/dbt-expectations