On-prem DB to Azure Cloud Pipeline with Data Factory, Lake Storage, Spark, Databricks, Synapse, PowerBI
- Project Overview
- Project Architecture
2.1. Data Ingestion
2.2. Data Transformation
2.3. Data Loading
2.4. Data Reporting - Credits
- Contact
This project can be defined as End-to-end Data Engineering Project applied in Azure Cloud. Basically, Data Ingestion is applied with using Data Factory which gets raw data from on-premise SQL DB to Azure Data Lake storage in bronze layer, then data transformation process is applied by Azure Databricks using Spark and transformed data is stored in silver layer and gold layer kept cleansed data which is loaded into Synapse Serverless DB and its data is visualized in PowerBI report. Also, I used Azure Active Directory (AAD) and Azure Key Vault for the data monitoring and governance purpose.
You can find the detailed information on the diagram below:
- Connected the on-premise SQL Server with Azure using Microsoft Integration Runtime.
- Setup the Resource group with needed services (Key Vault, Storage Account, Data Factory, Databricks, Synapse Analytics)
- Migrated the tables from on-premise SQL Server to Azure Data Lake Storage Gen2.
- Mounted Azure Blob Storage to Databricks to retrieve raw data from the Data Lake.
- Used Spark Cluster in Azure Databricks to clean and refine the raw data.
- Saved the cleaned data in a Delta format; optimized for further analysis.
- Used Azure Synapse Analytics to load the refined data efficiently.
- Created SQL database and connected it to the data lake.
- Connected Microsoft Power BI to Azure Synapse, and used the Views of the DB to create interactive and insightful data visualizations.
- Data Source: SQL Server
- Orchestration: Azure Data Factory
- Ingestion: Azure Data Lake Gen2
- Storage: Azure Synapse Analytics
- Authentication and Secrets Management: Azure Active Directory and Azure Key Vault
- Data Visualization: PowerBI
- This Project is inspired by the video of the YouTube Channel "Mr. K Talks Tech"