Flight Delays

Abstract

The team project's goal is to create a supervised machine learning model able to accurately predict which flights will be delayed based on flight number, origin, destination, and day of the week. The team decided on the topic of flight delays because we like to travel, and we want to improve our travel experience. Using the raw data with Tableau, questions such as which city or routes had the most delays. The project uses a dataset, created by Yuanyu 'Wendy' Mu on Kaggle, covering flight delays from 2009 to 2018 [1]. The original data source is from the United States Department of Transportation [2].

Tableau

Questions

Which route had the most delays? or this part of the dashboard, we will be using Tableau to create graph. We placed “Flight Status” and “Origin to Destination” in columns. In rows, we placed the Count (Flight Status). This is continuous For the filter, we also placed the Flight Status but it is filtered to “Delay.” Lastly, we placed “Origin to Destination” in colors. Findings: Flights from Denver to Aspen has the most delays Image- (https://github.com/mjncardenas/Pandas-Project/blob/michelle/Resources/delayed_routes.png)
Which airport had the most delays? Salt Lake International Airport This map shows which US cities has the most delays The larger the circle the more the delays. The spectrum goes from blue to red, where red has the most and the blue the least amount of delays Tableau was also used to create the map. For this map, the Destination Longitude and Latitude were placed for columns and rows We placed flight status in filter, size and color. Findings: Salt Lake City International Airport has the most delays

Image- (https://github.com/mjncardenas/Pandas-Project/blob/michelle/Resources/US_cities_delays.png)

Which day of the week is the best to fly? Daily flight status This graph was created in Tableau This graph will show the total delay hours per day of the week. In order to create bar graph, we added “Flight Status” in columns. We also placed “Flight Status” and “Weekday” in rows. Findings: Best day to fly because it has the least amount of delays is Thursday. Worst day to fly is Friday followed by Monday.

Image- (https://github.com/mjncardenas/Pandas-Project/blob/michelle/Resources/Daily_Flight_Status.png) Which day is best to fly on? Thursday

Interactive Interactive section also created in Tableau For the interactive part of the dashboard, you will be able to choose the Origin Airport and the Destination Airport After choosing the two airports, one will see the average of the total flight. In order to create this, we placed “Origin Airport” and “Destination Airport” in rows. To create filter, we placed “Flight Status,” “Origin Airport,” and “Destination Airport” in filter. This interactive page can be used when you would like to know how likely your flight will be delayed. Image- (https://github.com/mjncardenas/Pandas-Project/blob/michelle/Resources/interactive.png)

Preliminary Data Processing

Flight data from 2009 was selected from the Kaggle dataset due to the sheer data size of each CSV file. Then the team selected a single airline by selecting the callsign, 'OO,' SkyWest Airlines, and dropping the rest of the data. A NaN check was conducted, and all NaNs were dropped. The date needed to be transformed to a timestamp before continuing. Then, the date was converted to a weekday number, and then it was converted to the day's name. Delay status needed to be changed to an integer type for machine learning. Then data frames were created for each table to load into the database. These four tables are flight data, airport, flight status, and days of the week. The cleaned data was then uploaded to the PostgreSQL database and downloaded into the machine learning notebook using SQLalchemy.

Machine Learning Model

The model's target is 'flight status,' and the rest of the data remained as features. Dummy variables were assigned to the data and split into train and test sets using the default ratio. The chosen machine learning model is a random forest model. The random forest was selected to avoid overfitting, its ability to rank the significance of input variables, and efficiency with large data sets. The downside to using this model is the user cannot determine how the random forest model arrived at its answer.

Model Results

First Model

The first random forest model had an accuracy of around 60.81% using 64 n_estimators (Figure 1). The model is mediocre at answering which flights will be delayed. The features used in model one were: flight_id, fl_num, origin, dest, flight_status, and weekday. One error in the model is having flight_id as one of the features. Flight ID does not provide any useful information for the random forest. Another issue is using fl_num, origin, and dest as features as well. Flight number, origin, and destination are collinear because each route is assigned a specific flight number. The issue here is colinearity will confound the model: "Collinearity tends to inflate the variance of at least one estimated regression coefficient,ˆβj. This can cause at least some regression coefficients to have the wrong sign [3]." The sorting features by importance show low percentages for each of the flight numbers and weekdays (Figure 2). The team hoped by removing the collinear features; these importance scores could be increased. Another issue with the model is its low F-1 score. In a travel scenario, being late or early can cause problems for the traveler. Thus, a balance between precision and recall is required [4].

Second Model

The second random forest model had an accuracy of around 60.70% using 100 n_estimators to improve the model's accuracy (Figure 3). It had a similar F-1 score as the first model. The output for features sorted by importance barely changed [Figure 4]. Unfortunately, there was little change in the model's output on its second execution and would not be suitable for field use. One issue that might remain is the discrepancy between the number of delayed, on-time, and early flights [Figure 3]. Due to this issue, both models had low F-1 scores for on-time flights. The second model made a slight improvement in the precision of identifying on-time flights.

Conclusion

The current random forest model is decent at predicting delayed, on-time, and early flights. Due to the early, on-time, and late flight sample size issue, adaptive or gradient boosting should be used rather than bootstrap aggregation. To further mitigate the smaller sample sizes issue, undersampling or SMOTEENN should be used as well. Further modeling should include weather data and adding the month the flight occurred to understand flight schedules better. The project as a whole can be improved by doing the following: using more years and airline data, using JavaScript to provide better in-depth visual data analysis.

Figure 1

Figure 2

Figure 3

Figure 4

References

[1] https://www.kaggle.com/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018?select=2011.csv

[2] https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

[3] http://www.stat.tamu.edu/~hart/652/collinear.pdf

[4] https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

Communication Protocol

Biweekly zoom meetings
Communication via Slack or email will occur daily
In emergencies: call team members

Project Outline

Upload data CSV file into a Jupyter Notebook for cleaning
Send raw data to Tableau for dashboard creation
Clean data using Pandas
Create data frames to send to Postgres database
Crete SQLalchemy connection to database
Write data frames to database
Create new Jupyter Notebook for a machine learning model
Upload clean data from database
Create a machine learning model
Execute model and create confusion matrix

Dashboard: https://public.tableau.com/profile/michelle.cardenas#!/vizhome/skywestproject/PandasProjectStoryboard?publish=yes

Google Slides flashcards for presentation: https://docs.google.com/presentation/d/1OwAqNvC6R_ztFjqak5jTbcJXZBwkI31_YG8IKfRAUPQ/edit?usp=sharing

Sample Dashboard: https://docs.google.com/presentation/d/1TuF1_l5AwxgvbOCYwFYlXY_Ghcyn62s73fqBKCvWS-M/edit?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Database		Database
Resources		Resources
.Rhistory		.Rhistory
.gitignore		.gitignore
.skywest.csv.icloud		.skywest.csv.icloud
.~skywest project__699.twbr		.~skywest project__699.twbr
2009.7z		2009.7z
Airline Codes.docx		Airline Codes.docx
CC_Attribution_ShareAlike_4.0_International.md		CC_Attribution_ShareAlike_4.0_International.md
Communication Protocol.docx		Communication Protocol.docx
Complete_Updated_M2.csv		Complete_Updated_M2.csv
Conclusion.docx		Conclusion.docx
Data Clean Up Process.docx		Data Clean Up Process.docx
MIT_License.md		MIT_License.md
README.md		README.md
Screenshot_J_N.PNG		Screenshot_J_N.PNG
WEEK 1 Presentation.docx		WEEK 1 Presentation.docx
airport_codes.csv		airport_codes.csv
cleaned_oo_data.7z		cleaned_oo_data.7z
cleaning_data.ipynb		cleaning_data.ipynb
database_preview.PNG		database_preview.PNG
oo_random_forest_model.ipynb		oo_random_forest_model.ipynb
skywest project.twb		skywest project.twb
skywest.ipynb		skywest.ipynb
tableau_data_cleaning.ipynb		tableau_data_cleaning.ipynb
tableau_data_mergeing..ipynb		tableau_data_mergeing..ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flight Delays

Abstract

Tableau

Preliminary Data Processing

Machine Learning Model

Model Results

First Model

Second Model

Conclusion

References

Communication Protocol

Project Outline

About

Releases

Packages

Contributors 2

Languages

License

mjncardenas/Pandas-Project

Folders and files

Latest commit

History

Repository files navigation

Flight Delays

Abstract

Tableau

Preliminary Data Processing

Machine Learning Model

Model Results

First Model

Second Model

Conclusion

References

Communication Protocol

Project Outline

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages