Skip to content

dogucanelci/dogucanelci-GCP_Retail_Airflow_Data_Engineering_Project

Repository files navigation

alt text

Retail Data End to End Data Engineering Project on Google Cloud Platform Orchestrated by Airflow

Dataset

The "Online Retail II" dataset contains transactions from a UK-based online retail company. The data covers the period from 01/12/2009 to 09/12/2011 and includes sales of unique all-occasion giftware. The customer base primarily consists of wholesalers. Dataset Link: Online Retail Dataset

📝 Table of Contents

  1. Project Overview
  2. Project and DAG Architecture
    2.1. Installation and Initialization
    2.2. Data Ingestion
    2.3. Data Loading
    2.4. Data Quality Check (Raw Data)
    2.5. Data Transformation and Modeling
    2.6. Data Quality Check (Transformed Data)
    2.7. Data Transformation (Dimensional to Multi-Dimensional)
    2.8. Data Quality Check (Top Layer Data)
    2.9. Data Reporting
  3. Credits
  4. Contact

🔬 Project Overview

This project is an end-to-end Data Engineering project orchestrated by Airflow, using multiple technologies, and applied on the Google Cloud platform. Online Retail Dataset is used for this project.
In first step, Astro CLI and its dependicies are installed. Astro CLI is a open source commandline interface tool which is used for install, initialize and use airflow in local so easily. It also build file and folder template which is necessary for airflow usage. It creates a container to setup airflow, deploy and run defined dags in the container by using Docker.
In next step,retail dataset is pushed to Google Bucket Storage from defined container environment(local env). Then it is loaded into Google BigQuery Data Warehouse.
In next step, Data Quality check is applied by open source tool SODA. It has useful logging and alert functions to detect any inconsistencies during ETL processing of data with so basic yaml file definitions. After that, Raw Data is investigated and Data Modeling is applied by DBT, which is easy to implement, capable to high-lineage an open source tool used for transformation process of ETL. In this part, fact and dim tables are created into BigQuery Data Warehouse.
After data quality check by SODA, top layer, multi dimensional tables are created by transformation of dbt and last part, Retail Analytic Report is created in Google Looker Studio by using top layer transformed data. Each step which are described above represents a task in Retail DAG. You can see DAG structure and all tasks in detail in below.

📝 Project Architecture

You can find the detailed information on the diagram below:

alt text


Retail Dag Architecture:

alt text


📤 Installation and Initialization

  • Astro CLI and its prerequisites are installed and Astro project is initialized.
  • Astro project image is started and airflow server is published in localhost server.
  • Retail dataset is copied into local 'include' file in retail astro project which is also copied into project container running airflow app.
  • All initialization steps in Google Cloud Platform side is completed.(create new project, bucket, create service account json file with bigquery,storage admin role to access and connect GCP)
  • Requirements are installed.

alt text

📤 Data Ingestion

  • retail.py file is created for Retail DAG implementation.
  • Retail dataset is pushed to Google Bucket Storage from defined container environment(local env) by new task implementation in retail1 DAG as shown below:

alt text

⚙️ Data Loading

  • Raw data pushed into Bucket Storage is loaded into Google BigQuery Data Warehouse by implementing new 2 task in retail1 DAG.
  • First task create an empty table with defined schema informations.
  • Second task apply loading process.

alt text alt text

📥 Data Quality Check(for Raw Data)

  • Data Quality check is applied by open source tool SODA for raw data just loaded in BigQuery by creating a new task in retail1 DAG.

alt text

  • This check consists basic steps like data type and column name.

alt text

📊 Data Modeling and Transformation

  • Data Modeling design is done and transformation scripts are implemented which is used by dbt to create new dimensional model in BigQuery.

alt text

alt text

📊 Data Quality Check(for fact and dim tables)

  • Data Quality check is applied by open source tool SODA for dimensional model data just transformed in BigQuery by creating a new task in retail1 DAG.

alt text

  • This check consists detailed steps like:
    • All weekdays are in range 0-6
    • All datetimes,customers,products are unique
    • Price,total amounts must be greater than zero etc.

alt text

📊 Data Transformation

  • Data transformation scripts are implemented to create multi-dimensional top layer tables to feed the report directly by dbt tool.

alt text

  • Process is implemented in retail1 DAG as task.

alt text

📊 Data Quality Check(for top-layer tables)

  • Data Quality check is applied by open source tool SODA for multi-dimensional model data by creating a new task in retail1 DAG.

alt text

  • This check consists detailed steps like:
    • All customers have a country
    • All products have a stock code etc.

📊 Data Reporting

  • BigQuery is connected with Looker Studio BI , and used the Views of the DB to create interactive and insightful data visualizations.

alt text

🛠️ Technologies Used

  • Data Source: Google Cloud Bucket Storage
  • Orchestration: Airflow, Astro CLI, Docker
  • ETL Process: Python
  • Transformation: dbt tool
  • Data Quality Check: SODA
  • Date Warehousing: Bigquery, t-SQL
  • Storage: Google Cloud Bucket Storage
  • Data Visualization: Looker Studio

📋 Credits

📨 Contact Me

LinkedIn Website Gmail