Griffin 2.0.0 arch (#654)

* new proposal for data quality tool * repolish * enterprise job scheduler might retry or stop the downstream scheduler based on standardized result * illustrate data quality etl phase, append after business phase * elaborate integrate with business workflow * init DQDiagrams * add metric storage service in DQDiagrams * triggered on demand * add more metrics * update arch diagram * init two table diff result set * data platform upgrades data quality checking pipeline * typo * update it as data quality constrains
apache · Jul 3, 2024 · e293406 · e293406
1 parent 51bb5ac
commit e293406
Show file tree

Hide file tree

Showing 4 changed files with 294 additions and 0 deletions.
diff --git a/griffin-doc/DQDiagrams.md b/griffin-doc/DQDiagrams.md
@@ -0,0 +1,101 @@
+# DQ Diagrams
+## Arch
+![img.png](arch2.png)
+
+## Entities
+
+### DQMetric
+> Represents a generic data quality metric used to assess various aspects of data quality (quantitative).
+
+- **DQCompletenessMetric**
+  > Measures the completeness of data, ensuring that all required data is present.
+
+    - **DQCOUNTMetric**
+      > A specific completeness metric that counts the number of non-missing values in a dataset.
+    - **DQNULLPERCENTAGEMetric**
+      > A specific completeness metric that counts the percentage of null values in a dataset.
+
+- **DQAccuracyMetric**
+  > Measures the accuracy of data, ensuring that data values are correct and conform to a known standard.
+
+    - **DQNULLMetric**
+      > An accuracy metric that counts the number of NULL values in a dataset.
+
+- **DQPROFILEMetric**
+  > Measures the profile of data, such as max, min, median, avg, stddev
+
+    - **DQMAXMetric**
+      > A profile metric that max of values in a dataset.
+
+    - **DQMINMetric**
+      > A profile metric that min of values in a dataset.
+
+    - **DQMEDIANMetric**
+      > A profile metric that median of values in a dataset.
+
+    - **DQAVGMetric**
+      > A profile metric that average of values in a dataset.
+
+    - **DQSTDDEVMetric**
+      > A profile metric that standard deviation of values in a dataset.
+
+    - **DQTOPKMetric**
+      > A profile metric that list top k frequent items of values in a dataset.
+
+
+- **DQUniquenessMetric**
+  > Measures the uniqueness of data, ensuring that there are no duplicate records.
+
+    - **DQDISTINCTCOUNTMetric**
+      > A specific uniqueness metric that identifies and counts unique records in a dataset.
+
+- **DQFreshnessMetric**
+  > Measures the freshness of data, ensuring that the data is up-to-date.
+
+    - **DQTTUMetric (Time to Usable)**
+      > A freshness metric that measures the time taken for data to become usable after it is created or updated.
+
+- **DQDiffMetric**
+  > Compares data across different datasets or points in time to identify discrepancies.
+
+    - **DQTableDiffMetric**
+      > A specific diff metric that compares entire tables to identify differences.
+
+    - **DQFileDiffMetric**
+      > A specific diff metric that compares files to identify differences.
+
+- **MetricStorageService**
+  > A data quality metric storage and fetch service
+
+
+- **DQJob**
+  > Abstract Data Quality related Jobs.
+    - **MetricCollectingJob**
+      > A job that collects data quality metrics from various sources and stores them for analysis.
+ 
+    - **DQCheckJob**
+      > A job that performs data quality checks based on predefined rules and metrics.
+    
+    - **DQAlertJob**
+      > A job that generates alerts when data quality issues are detected.
+
+    - **DQDag**
+      > A directed acyclic graph that defines the dependencies and execution order of various data quality jobs.
+
+- **Scheduler**
+  > A system that schedules and manages the execution of data quality jobs. 
+  > This is the default scheduler, it will launch data quality jobs periodically.
+
+    - **DolphinSchdulerAdapter**
+      > Connects our planed data quality jobs with Apache Dolphinscheduler,
+      > allowing data quality jobs to be triggered upon the completion of dependent previous jobs.
+    - **AirflowSchdulerAdapter**
+      > Connects our planed data quality jobs with apache airflow,
+      > so that data quality jobs can be triggered upon the completion of dependent previous jobs.
+      >
+
+- **Worker**
+  > should we need another worker layer, since most work are done on big data side
+  > 
+> 
+
diff --git a/griffin-doc/DataQualityTool.md b/griffin-doc/DataQualityTool.md
@@ -0,0 +1,177 @@
+# Data Quality Tool
+
+## Introduction
+
+In the evolving landscape of data architecture, ensuring data quality remains a critical success factor for all companies.
+Data architectures have progressed significantly over recent years, transitioning from relational databases and data
+warehouses to data lakes, hybrid data lake and warehouse combinations, and modern lakehouses.
+
+Despite these advancements, data quality issues persist and have become increasingly vital, especially in the era of AI
+and data integration. Improving data quality is essential for all organizations, and maintaining it across various
+environments requires a combination of people, processes, and technology.
+
+To address these challenges, we will upgrade data quality tool designed to be easily adopted by any data organization.
+This tool abstracts common data quality problems and integrates seamlessly with diverse data architectures.
+
+## Data Quality Dimensions
+
+1. **Accuracy** – Data should be error-free by business needs.
+2. **Consistency** – Data should not conflict with other values across data sets.
+3. **Completeness** – Data should not be missing.
+4. **Timeliness** – Data should be up-to-date in a limited time frame
+5. **Uniqueness** – Data should have no duplicates.
+6. **Validity** – Data should conform to a specified format.
+
+## Our new Architecture
+
+Our new architecture consists of two primary layers: the Data Quality Layer and the Integration Layer.
+
+### Data Quality Constraints Layer
+
+This constraints layer abstracts the core concepts of the data quality lifecycle, focusing on:
+
+- **Defining Specific Data Quality Constraints**:
+  - **Metrics**: Establishing specific data quality metrics.
+  - **Anomaly Detection**: Implementing methods for detecting anomalies.
+  - **Actions**: Defining actions to be taken based on the data quality assessments.
+
+- **Measuring Data Quality**:
+  - Utilizing various connectors such as SQL, HTTP, and CMD to measure data quality across different systems.
+
+- **Unifying Data Quality Results**:
+  - Creating a standardized and structured view of data quality results across different dimensions to ensure a consistent understanding.
+
+- **Flexible Data Quality Jobs**:
+  - Designing data quality jobs within a generic, topological Directed Acyclic Graph (DAG) framework to facilitate easy plug-and-play functionality.
+
+### Integration Layer
+
+This layer provides a robust framework to enable users to integrate Griffin data quality pipelines seamlessly with their business processes. It includes:
+
+- **Scheduler Integration**:
+  - Ensuring seamless integration with typical schedulers for efficient pipeline execution.
+
+- **Apache DolphinScheduler Integration**:
+  - Facilitating effortless integration within the Java ecosystem to leverage Apache DolphinScheduler.
+
+- **Apache Airflow Integration**:
+  - Enabling smooth integration within the AI ecosystem using Apache Airflow.
+
+This architecture aims to provide a comprehensive and flexible approach to managing data quality
+and integrating it into various existing business workflows in data team.
+
+So that enterprise job scheduling system will launch optional data quality check pipelines after usual data jobs are finished.
+And maybe based on data quality result, schedule some actions such as retry or stop the downstream scheduling like circuit breaker.
+
+### Data Quality Layer
+
+#### Data Quality Constraints Definition
+
+This concept has been thoroughly discussed in the original Apache Griffin design documents. Essentially, we aim to quantify
+the data quality of a dataset based on the aforementioned dimensions. For example, to measure the count of records in a user
+table, our data quality constraint definition could be:
+
+**Simple Version:**
+
+- **Metric**
+  - Name: count_of_users
+  - Target: user_table
+  - Dimension: count
+- **Anomaly Condition:** $metric <= 0
+- **Post Action:** send alert
+
+**Advanced Version:**
+
+- **Metric**
+  - Name: count_of_users
+  - Target: user_table
+  - Filters: city = 'shanghai' and event_date = '20240601'
+  - Dimension: count
+- **Anomaly Condition:** $metric <= 0
+- **Post Action:** send alert
+
+#### Data Quality Pipelines(DAG)
+
+We support several typical data quality pipelines:
+
+**One Dataset Profiling Pipeline:**
+
+```plaintext
+recording_target_table_metric_job -> anomaly_condition_job -> post_action_job
+```
+
+**Dataset Diff Pipeline:**
+
+```plaintext
+recording_target_table1_metric_job  ->
+                                       \
+                                        -> anomaly_condition_job  -> post_action_job
+                                       /
+recording_target_table2_metric_job  ->
+```
+
+**Compute Platform Migration Pipeline:**
+
+```plaintext
+run_job_on_platform_v1 -> recording_target_table_metric_job_on_v1  ->
+                                                                       \
+                                                                        -> anomaly_condition_job  -> post_action_job
+                                                                       /
+run_job_on_platform_v2 -> recording_target_table_metric_job_on_v2  ->
+```
+#### Data Quality Report
+
+- **Meet Expectations**
+  + Data Quality Constrain 1: Passed
+  + Data Quality Constrain 2: Passed
+- **Does Not Meet Expectations**
+  + Data Quality Constrain 3: Failed
+    - Violation details
+    - Possible root cause 
+  + Data Quality Constrain 4: Failed
+    - Violation details
+    - Possible root cause
+
+#### Connectors
+
+The executor measures the data quality of the target dataset by recording the metrics. It supports many predefined protocols,
+and customers can extend the executor protocol if they want to add their own business logic.
+
+**Predefined Protocols:**
+
+- MySQL: `jdbc:mysql://hostname:port/database_name?user=username&password=password`
+- Presto: `jdbc:presto://hostname:port/catalog/schema`
+- Trino: `jdbc:trino://hostname:port/catalog/schema`
+- HTTP: `http://hostname:port/api/v1/query?query=<prometheus_query>`
+- Docker
+
+### Integration layer
+
+Every data team has its own existing scheduler.
+While we provide a default scheduler, for greater adoption, we will refactor
+our Apache Griffin scheduler capabilities to leverage our customers' schedulers.
+This involves redesigning our scheduler to either ingest job instances into our customers' schedulers
+or bridge our DQ pipelines to their DAGs.
+
+```plaintext
+  biz_etl_phase  ||     data_quality_phase
+                 ||
+business_etl_job -> recording_target_table1_metric_job  - ->
+                 ||                                         \
+                 ||                                           -> anomaly_condition_job  -> post_action_job
+                 ||                                         /
+business_etl_job -> recording_target_table2_metric_job  - ->
+                 ||
+```
+
+ - integration with a generic scheduler
+
+ - integration with apache dolphinscheduler
+
+ - integration with apache airflow
+
+
+
+
+
+
diff --git a/griffin-doc/TwoTablesDiffResult.md b/griffin-doc/TwoTablesDiffResult.md
@@ -0,0 +1,16 @@
+# Two tables diff result set
+We want to unify result set for two table comparing, when two tables' schema are the same,
+we can construct result set as below to let our users quickly find the difference between two tables.
+
+| diff_type  | col1_src     | col1_target | col2_src  | col2_target | col3_src  | col3_target | col4_src   | col4_target |
+|------------|--------------|-------------|-----------|-------------|-----------|-------------|------------|-------------|
+| missing    | prefix1      | NULL        | sug_vote1 | NULL        | pv_total1 | NULL        | 2024-01-01 | NULL        |
+| additional | NULL         | prefix1     | NULL      | sug_vote2   | NULL      | pv_total2   | NULL       | 2024-01-01  |
+| missing    | prefix3      | NULL        | sug_vote3 | NULL        | pv_total3 | NULL        | 2024-01-03 | NULL        |
+| additional | NULL         | prefix4     | NULL      | sug_vote4   | NULL      | pv_total3   | NULL       | 2024-01-03  |
+| missing    | prefix5      | NULL        | sug_vote5 | NULL        | pv_total5 | NULL        | 2024-01-05 | NULL        |
+| additional | NULL         | prefix5     | NULL      | sug_vote5   | NULL      | pv_total6   | NULL       | 2024-01-05  |
+| missing    | prefix7      | NULL        | sug_vote7 | NULL        | pv_total7 | NULL        | 2024-01-07 | NULL        |
+| additional | NULL         | prefix8     | NULL      | sug_vote8   | NULL      | pv_total8   | NULL       | 2024-01-07  |
+| missing    | prefix9      | NULL        | sug_vote9 | NULL        | pv_total9 | NULL        | 2024-01-09 | NULL        |
+| additional | NULL         | prefix10    | NULL      | sug_vote10  | NULL      | pv_total10  | NULL       | 2024-01-09  |
diff --git a/griffin-doc/arch2.png b/griffin-doc/arch2.png