In this project, I developed a complete data pipeline, guiding data from collection to insights through a series of structured stages.
- Extracted data, cleaned it and loaded into Google Cloud Storage for further processing.
- Transformed and modeled the data using fact and dimensional data modeling concepts using Python.
- Using ETL concept, orchestrated the data pipeline on Mage AI and loaded the transformed data into Google BigQuery.
- Developed a dashboard on Looker Studio.
I'm using the dataset from DataCo Global. The data description is [here](data\Data Dictionary.pdf). It consists of 52 features in areas of provisioning, production, sales, and commercial distribution.
- Language: Python, SQL
- Extraction and transformation: Jupyter Notebook, Google BigQuery
- Storage: Google Cloud Storage
- Orchestration: Mage AI
- Dashboard: Looker Studio
Step 1: Cleaning and transformation - sc_data.ipynb Step 2: Storage Step 3: ETL, Orchestration - Mage: Extract, Transform, Load Step 4: Analytics - SQL script Step 5: Dashboard
The datasets are designed using the principles of fact and dim data modeling concepts.
In this step, I loaded the CSV file into VSCode and carried out data cleaning and transformation activities prior to organizing them into fact and dim tables.script
- Launched the SSH instance and installed required libraries and Mage AI library.
- ETL is carried out by accessing the external IP address and mage-ai port number.
- Created a new pipeline with the following stages:
After running the pipeline in Mage, the fact and dimensional tables were generated in Google Big Query. Performed few queries and crearted sales analytics table. SQL queries here.
After completing the analysis, I loaded the relevant tables into Looker Studio and created a dashboard. Dashboard here.