Skip to content

Commit

Permalink
Griffin 2.0.0 arch (#654)
Browse files Browse the repository at this point in the history
* new proposal for data quality tool

* repolish

* enterprise job scheduler might retry or stop the downstream scheduler based on standardized result

* illustrate data quality etl phase, append after business phase

* elaborate integrate with business workflow

* init DQDiagrams

* add metric storage service in DQDiagrams

* triggered on demand

* add more metrics

* update arch diagram

* init two table diff result set

* data platform upgrades data quality checking pipeline

* typo

* update it as data quality constrains
  • Loading branch information
guoyuepeng authored Jul 3, 2024
1 parent 51bb5ac commit e293406
Show file tree
Hide file tree
Showing 4 changed files with 294 additions and 0 deletions.
101 changes: 101 additions & 0 deletions griffin-doc/DQDiagrams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# DQ Diagrams
## Arch
![img.png](arch2.png)

## Entities

### DQMetric
> Represents a generic data quality metric used to assess various aspects of data quality (quantitative).
- **DQCompletenessMetric**
> Measures the completeness of data, ensuring that all required data is present.
- **DQCOUNTMetric**
> A specific completeness metric that counts the number of non-missing values in a dataset.
- **DQNULLPERCENTAGEMetric**
> A specific completeness metric that counts the percentage of null values in a dataset.
- **DQAccuracyMetric**
> Measures the accuracy of data, ensuring that data values are correct and conform to a known standard.
- **DQNULLMetric**
> An accuracy metric that counts the number of NULL values in a dataset.
- **DQPROFILEMetric**
> Measures the profile of data, such as max, min, median, avg, stddev
- **DQMAXMetric**
> A profile metric that max of values in a dataset.
- **DQMINMetric**
> A profile metric that min of values in a dataset.
- **DQMEDIANMetric**
> A profile metric that median of values in a dataset.
- **DQAVGMetric**
> A profile metric that average of values in a dataset.
- **DQSTDDEVMetric**
> A profile metric that standard deviation of values in a dataset.
- **DQTOPKMetric**
> A profile metric that list top k frequent items of values in a dataset.

- **DQUniquenessMetric**
> Measures the uniqueness of data, ensuring that there are no duplicate records.
- **DQDISTINCTCOUNTMetric**
> A specific uniqueness metric that identifies and counts unique records in a dataset.
- **DQFreshnessMetric**
> Measures the freshness of data, ensuring that the data is up-to-date.
- **DQTTUMetric (Time to Usable)**
> A freshness metric that measures the time taken for data to become usable after it is created or updated.
- **DQDiffMetric**
> Compares data across different datasets or points in time to identify discrepancies.
- **DQTableDiffMetric**
> A specific diff metric that compares entire tables to identify differences.
- **DQFileDiffMetric**
> A specific diff metric that compares files to identify differences.
- **MetricStorageService**
> A data quality metric storage and fetch service

- **DQJob**
> Abstract Data Quality related Jobs.
- **MetricCollectingJob**
> A job that collects data quality metrics from various sources and stores them for analysis.
- **DQCheckJob**
> A job that performs data quality checks based on predefined rules and metrics.
- **DQAlertJob**
> A job that generates alerts when data quality issues are detected.
- **DQDag**
> A directed acyclic graph that defines the dependencies and execution order of various data quality jobs.
- **Scheduler**
> A system that schedules and manages the execution of data quality jobs.
> This is the default scheduler, it will launch data quality jobs periodically.
- **DolphinSchdulerAdapter**
> Connects our planed data quality jobs with Apache Dolphinscheduler,
> allowing data quality jobs to be triggered upon the completion of dependent previous jobs.
- **AirflowSchdulerAdapter**
> Connects our planed data quality jobs with apache airflow,
> so that data quality jobs can be triggered upon the completion of dependent previous jobs.
>
- **Worker**
> should we need another worker layer, since most work are done on big data side
>
>
177 changes: 177 additions & 0 deletions griffin-doc/DataQualityTool.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Data Quality Tool

## Introduction

In the evolving landscape of data architecture, ensuring data quality remains a critical success factor for all companies.
Data architectures have progressed significantly over recent years, transitioning from relational databases and data
warehouses to data lakes, hybrid data lake and warehouse combinations, and modern lakehouses.

Despite these advancements, data quality issues persist and have become increasingly vital, especially in the era of AI
and data integration. Improving data quality is essential for all organizations, and maintaining it across various
environments requires a combination of people, processes, and technology.

To address these challenges, we will upgrade data quality tool designed to be easily adopted by any data organization.
This tool abstracts common data quality problems and integrates seamlessly with diverse data architectures.

## Data Quality Dimensions

1. **Accuracy** – Data should be error-free by business needs.
2. **Consistency** – Data should not conflict with other values across data sets.
3. **Completeness** – Data should not be missing.
4. **Timeliness** – Data should be up-to-date in a limited time frame
5. **Uniqueness** – Data should have no duplicates.
6. **Validity** – Data should conform to a specified format.

## Our new Architecture

Our new architecture consists of two primary layers: the Data Quality Layer and the Integration Layer.

### Data Quality Constraints Layer

This constraints layer abstracts the core concepts of the data quality lifecycle, focusing on:

- **Defining Specific Data Quality Constraints**:
- **Metrics**: Establishing specific data quality metrics.
- **Anomaly Detection**: Implementing methods for detecting anomalies.
- **Actions**: Defining actions to be taken based on the data quality assessments.

- **Measuring Data Quality**:
- Utilizing various connectors such as SQL, HTTP, and CMD to measure data quality across different systems.

- **Unifying Data Quality Results**:
- Creating a standardized and structured view of data quality results across different dimensions to ensure a consistent understanding.

- **Flexible Data Quality Jobs**:
- Designing data quality jobs within a generic, topological Directed Acyclic Graph (DAG) framework to facilitate easy plug-and-play functionality.

### Integration Layer

This layer provides a robust framework to enable users to integrate Griffin data quality pipelines seamlessly with their business processes. It includes:

- **Scheduler Integration**:
- Ensuring seamless integration with typical schedulers for efficient pipeline execution.

- **Apache DolphinScheduler Integration**:
- Facilitating effortless integration within the Java ecosystem to leverage Apache DolphinScheduler.

- **Apache Airflow Integration**:
- Enabling smooth integration within the AI ecosystem using Apache Airflow.

This architecture aims to provide a comprehensive and flexible approach to managing data quality
and integrating it into various existing business workflows in data team.

So that enterprise job scheduling system will launch optional data quality check pipelines after usual data jobs are finished.
And maybe based on data quality result, schedule some actions such as retry or stop the downstream scheduling like circuit breaker.

### Data Quality Layer

#### Data Quality Constraints Definition

This concept has been thoroughly discussed in the original Apache Griffin design documents. Essentially, we aim to quantify
the data quality of a dataset based on the aforementioned dimensions. For example, to measure the count of records in a user
table, our data quality constraint definition could be:

**Simple Version:**

- **Metric**
- Name: count_of_users
- Target: user_table
- Dimension: count
- **Anomaly Condition:** $metric <= 0
- **Post Action:** send alert

**Advanced Version:**

- **Metric**
- Name: count_of_users
- Target: user_table
- Filters: city = 'shanghai' and event_date = '20240601'
- Dimension: count
- **Anomaly Condition:** $metric <= 0
- **Post Action:** send alert

#### Data Quality Pipelines(DAG)

We support several typical data quality pipelines:

**One Dataset Profiling Pipeline:**

```plaintext
recording_target_table_metric_job -> anomaly_condition_job -> post_action_job
```

**Dataset Diff Pipeline:**

```plaintext
recording_target_table1_metric_job ->
\
-> anomaly_condition_job -> post_action_job
/
recording_target_table2_metric_job ->
```

**Compute Platform Migration Pipeline:**

```plaintext
run_job_on_platform_v1 -> recording_target_table_metric_job_on_v1 ->
\
-> anomaly_condition_job -> post_action_job
/
run_job_on_platform_v2 -> recording_target_table_metric_job_on_v2 ->
```
#### Data Quality Report

- **Meet Expectations**
+ Data Quality Constrain 1: Passed
+ Data Quality Constrain 2: Passed
- **Does Not Meet Expectations**
+ Data Quality Constrain 3: Failed
- Violation details
- Possible root cause
+ Data Quality Constrain 4: Failed
- Violation details
- Possible root cause

#### Connectors

The executor measures the data quality of the target dataset by recording the metrics. It supports many predefined protocols,
and customers can extend the executor protocol if they want to add their own business logic.

**Predefined Protocols:**

- MySQL: `jdbc:mysql://hostname:port/database_name?user=username&password=password`
- Presto: `jdbc:presto://hostname:port/catalog/schema`
- Trino: `jdbc:trino://hostname:port/catalog/schema`
- HTTP: `http://hostname:port/api/v1/query?query=<prometheus_query>`
- Docker

### Integration layer

Every data team has its own existing scheduler.
While we provide a default scheduler, for greater adoption, we will refactor
our Apache Griffin scheduler capabilities to leverage our customers' schedulers.
This involves redesigning our scheduler to either ingest job instances into our customers' schedulers
or bridge our DQ pipelines to their DAGs.

```plaintext
biz_etl_phase || data_quality_phase
||
business_etl_job -> recording_target_table1_metric_job - ->
|| \
|| -> anomaly_condition_job -> post_action_job
|| /
business_etl_job -> recording_target_table2_metric_job - ->
||
```

- integration with a generic scheduler

- integration with apache dolphinscheduler

- integration with apache airflow






16 changes: 16 additions & 0 deletions griffin-doc/TwoTablesDiffResult.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Two tables diff result set
We want to unify result set for two table comparing, when two tables' schema are the same,
we can construct result set as below to let our users quickly find the difference between two tables.

| diff_type | col1_src | col1_target | col2_src | col2_target | col3_src | col3_target | col4_src | col4_target |
|------------|--------------|-------------|-----------|-------------|-----------|-------------|------------|-------------|
| missing | prefix1 | NULL | sug_vote1 | NULL | pv_total1 | NULL | 2024-01-01 | NULL |
| additional | NULL | prefix1 | NULL | sug_vote2 | NULL | pv_total2 | NULL | 2024-01-01 |
| missing | prefix3 | NULL | sug_vote3 | NULL | pv_total3 | NULL | 2024-01-03 | NULL |
| additional | NULL | prefix4 | NULL | sug_vote4 | NULL | pv_total3 | NULL | 2024-01-03 |
| missing | prefix5 | NULL | sug_vote5 | NULL | pv_total5 | NULL | 2024-01-05 | NULL |
| additional | NULL | prefix5 | NULL | sug_vote5 | NULL | pv_total6 | NULL | 2024-01-05 |
| missing | prefix7 | NULL | sug_vote7 | NULL | pv_total7 | NULL | 2024-01-07 | NULL |
| additional | NULL | prefix8 | NULL | sug_vote8 | NULL | pv_total8 | NULL | 2024-01-07 |
| missing | prefix9 | NULL | sug_vote9 | NULL | pv_total9 | NULL | 2024-01-09 | NULL |
| additional | NULL | prefix10 | NULL | sug_vote10 | NULL | pv_total10 | NULL | 2024-01-09 |
Binary file added griffin-doc/arch2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e293406

Please sign in to comment.