The project aims to build a data warehouse for incidents that happened in San Francisco from scratch. There are 2 pipelines in this project:
full_load_pipeline
: Load all data to the data warehouse for the first load.incremental_load_pipeline
: Run daily and load new data into the data warehouse as well as manage changes in data by implementing SCD Types 1 and 2.
-
Staging Area: Amazon S3
-
Data warehouse: Amazon Redshift
-
Visualization: Power BI
-
Orchestration: Apache Airflow with Docker
-
Processing: Python, SQL
The architecture of this project is presented as follows:
- Data is sourced from Socrata API and ingested into
raw zone
of Staging Area hosted on S3. - Raw data is cleansed and standardized before moving to
cleansed zone
. - Cleansed data is transformed into data model used in data warehouse and loaded into
stage zone
. Now the data is ready for moving to data warehouse. - Reports are created in Power BI from the data in data warehouse.
- The data warehouse schema is designed follow
Star schema
model. - To manage the relation many to many between incidents and incident categories, the
bridge table
is used. - To manage changes in data,
SCD Type 1
is applied to all table,SCD Type 2
applied todim_category
anddim_intersection
tables. - These SCD types are implemented using SQL.
- To manage the repetitive jobs of ingest (from source to staging area) and load to data warehouse, I build 2 Airflow custom operators are
Socrata_to_S3
andS3_to_Redshift
. - Full load pipeline architecture
- Full load pipeline architecture (zoom in)
- Incremental load pipeline architecture
- Incremental load pipeline architecture (zoom in)
Some dashboards create from the data from data warehouse
- Total report for year 2022
- Daily report for everyday
- Use Python to process data, especially the date time datatype. Most used libraries to process data are Pandas and Numpy.
- Implement OOP in the project.
- Structure files in project.
- Write custom operators for repetive jobs (code)
- Use runtime variable and template variable
- Group tasks that belong to each stage of the pipeline for more briefness when looking
- Connect to cloud services through Hook
- Secure connections and other important variables by using Variable and Connection features
- Implement Airflow using Docker
- Components of Airflow
- Use S3 as the Staging Area
- Use Redshift as the data warehouse
- How Redshift works
- Connect to Redshift and creata dashboard