-
Notifications
You must be signed in to change notification settings - Fork 590
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* new proposal for data quality tool * repolish * enterprise job scheduler might retry or stop the downstream scheduler based on standardized result * illustrate data quality etl phase, append after business phase * elaborate integrate with business workflow * init DQDiagrams * add metric storage service in DQDiagrams * triggered on demand * add more metrics * update arch diagram * init two table diff result set * data platform upgrades data quality checking pipeline * typo * update it as data quality constrains
- Loading branch information
1 parent
51bb5ac
commit e293406
Showing
4 changed files
with
294 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# DQ Diagrams | ||
## Arch | ||
![img.png](arch2.png) | ||
|
||
## Entities | ||
|
||
### DQMetric | ||
> Represents a generic data quality metric used to assess various aspects of data quality (quantitative). | ||
- **DQCompletenessMetric** | ||
> Measures the completeness of data, ensuring that all required data is present. | ||
- **DQCOUNTMetric** | ||
> A specific completeness metric that counts the number of non-missing values in a dataset. | ||
- **DQNULLPERCENTAGEMetric** | ||
> A specific completeness metric that counts the percentage of null values in a dataset. | ||
- **DQAccuracyMetric** | ||
> Measures the accuracy of data, ensuring that data values are correct and conform to a known standard. | ||
- **DQNULLMetric** | ||
> An accuracy metric that counts the number of NULL values in a dataset. | ||
- **DQPROFILEMetric** | ||
> Measures the profile of data, such as max, min, median, avg, stddev | ||
- **DQMAXMetric** | ||
> A profile metric that max of values in a dataset. | ||
- **DQMINMetric** | ||
> A profile metric that min of values in a dataset. | ||
- **DQMEDIANMetric** | ||
> A profile metric that median of values in a dataset. | ||
- **DQAVGMetric** | ||
> A profile metric that average of values in a dataset. | ||
- **DQSTDDEVMetric** | ||
> A profile metric that standard deviation of values in a dataset. | ||
- **DQTOPKMetric** | ||
> A profile metric that list top k frequent items of values in a dataset. | ||
|
||
- **DQUniquenessMetric** | ||
> Measures the uniqueness of data, ensuring that there are no duplicate records. | ||
- **DQDISTINCTCOUNTMetric** | ||
> A specific uniqueness metric that identifies and counts unique records in a dataset. | ||
- **DQFreshnessMetric** | ||
> Measures the freshness of data, ensuring that the data is up-to-date. | ||
- **DQTTUMetric (Time to Usable)** | ||
> A freshness metric that measures the time taken for data to become usable after it is created or updated. | ||
- **DQDiffMetric** | ||
> Compares data across different datasets or points in time to identify discrepancies. | ||
- **DQTableDiffMetric** | ||
> A specific diff metric that compares entire tables to identify differences. | ||
- **DQFileDiffMetric** | ||
> A specific diff metric that compares files to identify differences. | ||
- **MetricStorageService** | ||
> A data quality metric storage and fetch service | ||
|
||
- **DQJob** | ||
> Abstract Data Quality related Jobs. | ||
- **MetricCollectingJob** | ||
> A job that collects data quality metrics from various sources and stores them for analysis. | ||
- **DQCheckJob** | ||
> A job that performs data quality checks based on predefined rules and metrics. | ||
- **DQAlertJob** | ||
> A job that generates alerts when data quality issues are detected. | ||
- **DQDag** | ||
> A directed acyclic graph that defines the dependencies and execution order of various data quality jobs. | ||
- **Scheduler** | ||
> A system that schedules and manages the execution of data quality jobs. | ||
> This is the default scheduler, it will launch data quality jobs periodically. | ||
- **DolphinSchdulerAdapter** | ||
> Connects our planed data quality jobs with Apache Dolphinscheduler, | ||
> allowing data quality jobs to be triggered upon the completion of dependent previous jobs. | ||
- **AirflowSchdulerAdapter** | ||
> Connects our planed data quality jobs with apache airflow, | ||
> so that data quality jobs can be triggered upon the completion of dependent previous jobs. | ||
> | ||
- **Worker** | ||
> should we need another worker layer, since most work are done on big data side | ||
> | ||
> | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
# Data Quality Tool | ||
|
||
## Introduction | ||
|
||
In the evolving landscape of data architecture, ensuring data quality remains a critical success factor for all companies. | ||
Data architectures have progressed significantly over recent years, transitioning from relational databases and data | ||
warehouses to data lakes, hybrid data lake and warehouse combinations, and modern lakehouses. | ||
|
||
Despite these advancements, data quality issues persist and have become increasingly vital, especially in the era of AI | ||
and data integration. Improving data quality is essential for all organizations, and maintaining it across various | ||
environments requires a combination of people, processes, and technology. | ||
|
||
To address these challenges, we will upgrade data quality tool designed to be easily adopted by any data organization. | ||
This tool abstracts common data quality problems and integrates seamlessly with diverse data architectures. | ||
|
||
## Data Quality Dimensions | ||
|
||
1. **Accuracy** – Data should be error-free by business needs. | ||
2. **Consistency** – Data should not conflict with other values across data sets. | ||
3. **Completeness** – Data should not be missing. | ||
4. **Timeliness** – Data should be up-to-date in a limited time frame | ||
5. **Uniqueness** – Data should have no duplicates. | ||
6. **Validity** – Data should conform to a specified format. | ||
|
||
## Our new Architecture | ||
|
||
Our new architecture consists of two primary layers: the Data Quality Layer and the Integration Layer. | ||
|
||
### Data Quality Constraints Layer | ||
|
||
This constraints layer abstracts the core concepts of the data quality lifecycle, focusing on: | ||
|
||
- **Defining Specific Data Quality Constraints**: | ||
- **Metrics**: Establishing specific data quality metrics. | ||
- **Anomaly Detection**: Implementing methods for detecting anomalies. | ||
- **Actions**: Defining actions to be taken based on the data quality assessments. | ||
|
||
- **Measuring Data Quality**: | ||
- Utilizing various connectors such as SQL, HTTP, and CMD to measure data quality across different systems. | ||
|
||
- **Unifying Data Quality Results**: | ||
- Creating a standardized and structured view of data quality results across different dimensions to ensure a consistent understanding. | ||
|
||
- **Flexible Data Quality Jobs**: | ||
- Designing data quality jobs within a generic, topological Directed Acyclic Graph (DAG) framework to facilitate easy plug-and-play functionality. | ||
|
||
### Integration Layer | ||
|
||
This layer provides a robust framework to enable users to integrate Griffin data quality pipelines seamlessly with their business processes. It includes: | ||
|
||
- **Scheduler Integration**: | ||
- Ensuring seamless integration with typical schedulers for efficient pipeline execution. | ||
|
||
- **Apache DolphinScheduler Integration**: | ||
- Facilitating effortless integration within the Java ecosystem to leverage Apache DolphinScheduler. | ||
|
||
- **Apache Airflow Integration**: | ||
- Enabling smooth integration within the AI ecosystem using Apache Airflow. | ||
|
||
This architecture aims to provide a comprehensive and flexible approach to managing data quality | ||
and integrating it into various existing business workflows in data team. | ||
|
||
So that enterprise job scheduling system will launch optional data quality check pipelines after usual data jobs are finished. | ||
And maybe based on data quality result, schedule some actions such as retry or stop the downstream scheduling like circuit breaker. | ||
|
||
### Data Quality Layer | ||
|
||
#### Data Quality Constraints Definition | ||
|
||
This concept has been thoroughly discussed in the original Apache Griffin design documents. Essentially, we aim to quantify | ||
the data quality of a dataset based on the aforementioned dimensions. For example, to measure the count of records in a user | ||
table, our data quality constraint definition could be: | ||
|
||
**Simple Version:** | ||
|
||
- **Metric** | ||
- Name: count_of_users | ||
- Target: user_table | ||
- Dimension: count | ||
- **Anomaly Condition:** $metric <= 0 | ||
- **Post Action:** send alert | ||
|
||
**Advanced Version:** | ||
|
||
- **Metric** | ||
- Name: count_of_users | ||
- Target: user_table | ||
- Filters: city = 'shanghai' and event_date = '20240601' | ||
- Dimension: count | ||
- **Anomaly Condition:** $metric <= 0 | ||
- **Post Action:** send alert | ||
|
||
#### Data Quality Pipelines(DAG) | ||
|
||
We support several typical data quality pipelines: | ||
|
||
**One Dataset Profiling Pipeline:** | ||
|
||
```plaintext | ||
recording_target_table_metric_job -> anomaly_condition_job -> post_action_job | ||
``` | ||
|
||
**Dataset Diff Pipeline:** | ||
|
||
```plaintext | ||
recording_target_table1_metric_job -> | ||
\ | ||
-> anomaly_condition_job -> post_action_job | ||
/ | ||
recording_target_table2_metric_job -> | ||
``` | ||
|
||
**Compute Platform Migration Pipeline:** | ||
|
||
```plaintext | ||
run_job_on_platform_v1 -> recording_target_table_metric_job_on_v1 -> | ||
\ | ||
-> anomaly_condition_job -> post_action_job | ||
/ | ||
run_job_on_platform_v2 -> recording_target_table_metric_job_on_v2 -> | ||
``` | ||
#### Data Quality Report | ||
|
||
- **Meet Expectations** | ||
+ Data Quality Constrain 1: Passed | ||
+ Data Quality Constrain 2: Passed | ||
- **Does Not Meet Expectations** | ||
+ Data Quality Constrain 3: Failed | ||
- Violation details | ||
- Possible root cause | ||
+ Data Quality Constrain 4: Failed | ||
- Violation details | ||
- Possible root cause | ||
|
||
#### Connectors | ||
|
||
The executor measures the data quality of the target dataset by recording the metrics. It supports many predefined protocols, | ||
and customers can extend the executor protocol if they want to add their own business logic. | ||
|
||
**Predefined Protocols:** | ||
|
||
- MySQL: `jdbc:mysql://hostname:port/database_name?user=username&password=password` | ||
- Presto: `jdbc:presto://hostname:port/catalog/schema` | ||
- Trino: `jdbc:trino://hostname:port/catalog/schema` | ||
- HTTP: `http://hostname:port/api/v1/query?query=<prometheus_query>` | ||
- Docker | ||
|
||
### Integration layer | ||
|
||
Every data team has its own existing scheduler. | ||
While we provide a default scheduler, for greater adoption, we will refactor | ||
our Apache Griffin scheduler capabilities to leverage our customers' schedulers. | ||
This involves redesigning our scheduler to either ingest job instances into our customers' schedulers | ||
or bridge our DQ pipelines to their DAGs. | ||
|
||
```plaintext | ||
biz_etl_phase || data_quality_phase | ||
|| | ||
business_etl_job -> recording_target_table1_metric_job - -> | ||
|| \ | ||
|| -> anomaly_condition_job -> post_action_job | ||
|| / | ||
business_etl_job -> recording_target_table2_metric_job - -> | ||
|| | ||
``` | ||
|
||
- integration with a generic scheduler | ||
|
||
- integration with apache dolphinscheduler | ||
|
||
- integration with apache airflow | ||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Two tables diff result set | ||
We want to unify result set for two table comparing, when two tables' schema are the same, | ||
we can construct result set as below to let our users quickly find the difference between two tables. | ||
|
||
| diff_type | col1_src | col1_target | col2_src | col2_target | col3_src | col3_target | col4_src | col4_target | | ||
|------------|--------------|-------------|-----------|-------------|-----------|-------------|------------|-------------| | ||
| missing | prefix1 | NULL | sug_vote1 | NULL | pv_total1 | NULL | 2024-01-01 | NULL | | ||
| additional | NULL | prefix1 | NULL | sug_vote2 | NULL | pv_total2 | NULL | 2024-01-01 | | ||
| missing | prefix3 | NULL | sug_vote3 | NULL | pv_total3 | NULL | 2024-01-03 | NULL | | ||
| additional | NULL | prefix4 | NULL | sug_vote4 | NULL | pv_total3 | NULL | 2024-01-03 | | ||
| missing | prefix5 | NULL | sug_vote5 | NULL | pv_total5 | NULL | 2024-01-05 | NULL | | ||
| additional | NULL | prefix5 | NULL | sug_vote5 | NULL | pv_total6 | NULL | 2024-01-05 | | ||
| missing | prefix7 | NULL | sug_vote7 | NULL | pv_total7 | NULL | 2024-01-07 | NULL | | ||
| additional | NULL | prefix8 | NULL | sug_vote8 | NULL | pv_total8 | NULL | 2024-01-07 | | ||
| missing | prefix9 | NULL | sug_vote9 | NULL | pv_total9 | NULL | 2024-01-09 | NULL | | ||
| additional | NULL | prefix10 | NULL | sug_vote10 | NULL | pv_total10 | NULL | 2024-01-09 | |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.