👨🏻💻
- Author: @juansevargasc
- Dataset Source: 2022 U.S. Domestic Flights Departures, Kaggle
- Topics: Data Engineering, ETL, Data Analysis, Data Warehouse
- Environment
conda create --name <env> --file requirements.txt
# or
pip install -r requirements.txt
- Main python file
python src/main.py
This project aims to explore the US flight departures features in 2022. This will be made through the analysis of weather conditions, cancellations, dates, locations and carriers among others. Nevertheless, it will feature first a ETL pipeline to preprocess different data sources and then load into a OLAP database, for BI consumption.
Objectives
- Extract data from different sources. In this case it comes from 5 CSV Files but two of them are worked out to be in a Relational Database and the other to be a JSON file so simulate different types of sources. See prework.
- Design a data schema that allows to query data for BI purposes
- Create an ETL Pipeline.
- Clean data by choosing which
NaN
(empty) values should be dropped. - Standardizing names, making conventions.
- Testing and enforcing data types and schemas.
- Build a Star architecture.
Objectives
- Make questions interesting questions such as:
- Is there a correlation between delays and wheather?
- How many flights did a certain airline make during the year?
- What's the most common route? Is there an impact from wheather in a route?
- Make a Data exploration and characterize some columns.
- Make some Statistics:
- What's the average of flights per day?
- How many flights are delayed per day?
- Does the wheather events follow a normal distribution? Another type of distribuition?
Introduction
The project aims to analyze the files that are given in this dataset: 2022 U.S. Domestic Flights Departures
Author: Jacky LuoPrework
The prework is made to take some original files and export them to SQL database and a JSON file to simulate we have different data sources in the project. See more in Prework
Documentation of Stages
Final Dim - Fact Schema