Skip to content

Latest commit

 

History

History
117 lines (82 loc) · 4.85 KB

README.md

File metadata and controls

117 lines (82 loc) · 4.85 KB


Utilisation of Azure Cloud Services to architect and orchestrate data pipeline to perform ETL on Formula 1 racing dataset extracted from Ergast Developer API.

built-with-love powered-by-coffee cc-nc-sa

OverviewToolsArchitectureERDSupportLicense

Overview

The Ergast Developer API is an experimental web service that provides a historical record of motor racing data for non-commercial purposes. The API provides data for the Formula One series, from the beginning of the world championships in 1950 until now.

This project showcases a seamless data journey facilitated by Azure services. It begins with data extraction from the Ergast Developer API and harnesses Azure components such as Azure Active Directory, Service Principal, Azure Databricks, Key Vault, Azure Data Factory, and Azure Data Lake Gen2 to orchestrate this process efficiently. Within Azure Databricks, powered by Apache Spark, data undergoes the ETL (Extract, Transform, Load) process. The data begins its journey in the 'ingestion' folder, where it is initially received. It then proceeds to the 'transformations' folder, where it is refined and enhanced. Finally, the data finds its destination in the 'analysis' folder, where it is carefully organized and prepared for analysis. The orchestration of this data journey is managed through Azure Data Factory, representing a structured and efficient approach to data engineering and analysis.

The repository directory structure is as follows:

├── README.md          <- The top-level README for developers using this project. 
| 
├── Raw           <- Contains script to define table schemas
| 
├── Transformations         <- Scripts to aggregate and transform data
│  
├── analysis         <- Basic analysis of data from the transformations folder.  
| 
│ 
├── include                <- Configuration folder 
│   ├── common_functions.py    <- Common functions used throughout the ETL process.
│   │ 
│   ├── configuration.py       <- Houses configuration settings such as variables.
│      
|         
|
├── ingestion          <- Ingestion scripts for data files from ADLS Gen 2.
│      
├── resources          <- Resources for readme file.
|
├── set-up             <- Script for mounting ADLS Gen 2 to Databricks
|         
├── utils              <- SQL scripts for incremental load.

Tools

To build this project, the following tools were used:

  • Azure Databricks
  • Azure KeyVault
  • Azure Active Directory
  • Azure DataLake Gen 2
  • Azure Data Factory
  • Pyspark
  • SQL
  • Git

Architecture

The architecture of this project is inspired by the following, taken from Azure Architecture Center.

ERD

The database structure is shown in the following ER Diagram and explained in the Database User Guide.

Support

If you have any doubts, queries or, suggestions then, please connect with me on any of the following platforms:

Linkedin Badge Gmail Badge

License

by-nc-sa

This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.