Skip to content

Rutgers-Data-Science-Bootcamp/NYC_Bike_Accident_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Bike Accidents

Rutgers University Data Science Bootcamp Final Project

Team name: Met-A-Four

TeamMembers:

Member Role Responsibilities
Shirali Obul Project Manager Manage the Project flow, Technology and Communication
Moya Heinzelmann Database Lead Manage the Database and Loading data Process
Seung-Wook Noh Machine Learning Lead Manage the Machine Learning Model and Design
Vanessa Cartagena Dashboard Lead Manage Tableau Dashboard, EDA and Presentation

Selected Topic: NYC Bike Lane Safety

For this project, we analyze bike accidents across New York City from January 2020 to October 2022 by performing comprehensive exploratory data analysis and visualizations to have insights into the bike riding risk in NYC at different times of the day, weekdays, months, and streets. Bicycle trips make up about one percent of trips in the United states. However, it accounts for over two percent of people who die in motor vehicle crashes. Therefore, in large cities such as New York, residents and travelers must know which areas are safest at a given time of day. To generate this data, we created a machine learning model that can be fed with static/updated data from the sources, followed by the ETL process, and run the model to predict whether the accident happened on the bike lane. We aim to provide information that travelers and residents can use to plan bike rides in NYC and also first-line responders to take action based on accidents, whether on bike lanes or not.

Questions We Would Like to Answer:

  • Are there more accidents on or off bike lanes?
  • Which borough and/or streets have the most accidents?
  • What time has the most accidents?
    • Hour
    • Day/Night
    • Weekday
    • Month
  • How does different types of weather affect the frequency of bike accidents?
    • Rain
    • Snow
    • Visibility
    • Humidity
    • Clear
    • Fog
    • Cloudiness

Resources

Description of data and data sources

NYC_Bike_Risk -- This database uses a multitude of factors to input details on a bike accident:

  • 1st dataset contains location(longitutde, latitutde), borough, street, severity, time and date of accident over the course of 3 years from 2020-2022:
  • NYC-Crash-Cyclist-2020
  • 2nd dataset contains bike routesdata:
  • NYC-Bike-Lanes
  • 3rd dataset contains zipcode for boroughs which allow us to fill up missing values in 1st dataset:
  • NYC-ZIPCODE-MAP
  • 4th dataset is NYC weather data using OpenWeatherMap API for the days of accident happend:
  • NYC-Weather-Data

Tools

Creating Database

  • PostgreSQL
  • SQLAlchemy

Analyzing Data

  • Pandas
  • Numpy
  • Geopy
  • Machine Learning
  • Scikit-Learn

Dashboard

  • Tableau
  • Google Slides

^ back to top ^

Data Cleaning and ETL process

We have mainly used Python pandas library in Jupyter notebook to clean the data from 3 different resources: NYC_Crash_cyclist 2020-2022, NYC_Weather_2020-2022, and NYC_Bike_Lanes.

  • NYC_Crash_cyclist 2020-2022: Dropping columns without values and duplicated rows across all the columns, scraping missing zip codes with geopy based on geolocation, filled missing borough names based on zip code data; Transformed TIME of the accident to keep hours, not minutes or seconds to be able to merge with weather data where have hourly weather.

  • NYC_Weather_2020-2022: hourly weather data from OpenWeatherMap for 1004 days in which accidents happened in NYC again cleaned and transformed with python pandas library in jupyter notebooks, such as DateTime format for dates and times columns, with the formatting DATE column we were able to get Name of the weekday and months, we split it into DATE, NAME_OF_WEEKDAY, MONTH columns.

  • NYC_bike_lanes: First, we dropped the duplicated rows based on STREET NAME to get unique streets with bike lanes, then we split the BIKE_GEOME column into four geo parameters as it has two pairs of geolocation of the street (from street to street). This news LAT1, LAT2, LON1, and LON2 were used to get the bike lane column for the crash data by matching the accident location (lat and lon) within the two pairs of location parameters.

  • Final dataset: contains NYC bike crash data, weather data, and bike lane data, which we are using for EDA and Visualization with Tableau. Also, selected features are selected for ML model training.

  • Screen Shot 2022-11-06 at 5 39 48 PM

Database:

  • Crash and weather data are loaded into the SQL database and merged with SQL code. Crash_data, weather_data, merged data, and NYC_borough_zipcode data are stored in our database as tables; We stored all the tables in our GitHub database folder as well as we have our local database from where we are going to load merged_data by using SQLite in our python code for further analysis and modeling.

  • ERD:

QuickERD

  • Tables post merge:
  • Tables_post_merge

^ back to top ^

Machine Learning

Random Forest Classifier

  • Code is here

  • After connecting our jupyter notebook to the database by SQLAlchemy, we performed ETL to prepare data for ML and loaded it into the database as an ML_data table. From this dynamic database, read the ML_data table and print out the header for each column to see all of the features available.

Screen Shot 2022-10-30 at 10 57 30 PM

  • First, we checked categorical and numerical data in the dataset. Then, based on the features, we selected columns that might be important for the modeling and dropped columns that have duplicate information with others or no value on the prediction (such as DATE and COLLISION ID).

  • We split our data training and testing. We used the default 75% to 25% split.

  • We have tried supervised learning imblearn.ensemble library trained and compared two different ensemble classifiers. Balanced Random Forest Classifier and Easy Ensemble Classifier to predict bike lanes based on the features we selected for the model. Using both algorithms, we resampled the dataset, viewed the target classes' count, trained the ensemble classifier, calculated the balanced accuracy score, generated a confusion matrix, and developed a classification report. We chose Random forest algorithms because they can handle thousands of input variables without variable deletion, are robust to outliers and nonlinear data, and run efficiently on large datasets, as we have here. To improve prediction accuracy, we used Adaptive Boosting, called AdaBoost, which is easy to understand. In AdaBoost, a model is trained and then evaluated. After evaluating the errors of the first model, another model is introduced. This time, however, the model gives extra weight to the errors from the previous model. The purpose of this weighting is to minimize similar errors in subsequent models.

  • For each algorithm, we train the model using the training data. Calculate the balanced accuracy score from sklearn.metrics. Print the confusion matrix from sklearn.metrics. Finally, generate a classification report using the imbalanced_classification_report from imbalanced-learn. Print the feature importance sorted in descending order (a most important feature to least significant) along with the feature score.

  • The result of Balanced Random Forest Classifier: Screen Shot 2022-10-30 at 10 50 22 PM

  • With the Easy Ensemble AdaBoost classifier, we have improved the accuracy to 80.56% from 80.26% with the Balanced Random Forest Classifier. However, it took more time to execute, as we can see in the screenshots:

Screen Shot 2022-10-30 at 10 51 28 PM

  • Feature importance from the Balanced Random Forest Classifier with 100 iteration: featureimportance

^ back to top ^

Advantage and Limitations
  • We initially wanted to have ML to predict risk of bike riding, however, due to the fact that we only have bike accident data not total bike ride. Therefore, we did ML prediction on bike lanes or not by using logistic regression Random forest classifier.
  • Random Forests Classifier provides feature importance which allowed us visualize the features with importance.
  • Random Forests can be computationally much faster than Neural Network.
  • We still have rooms to improve the accuracy score as Random forests are found to be biased while dealing with categorical variables because we have more categorical data than numerical data in our dataset.

Neural Network Deep Learning Model (Trial)

  • Random Forests has better efficiency than Neural Network. Moreover, Neural Network takes longer time to perform analysis.
Limitations:
  • Neural Network has more layers to train and test but it does not necessarily mean that will bring better results.
  • It requires very large amount of data in order to perform better than other techniques.
  • It is extremely expensive to train due to complex data models. Moreover deep learning requires expensive GPUs and hundreds of machines. This increases cost to the users.

Dashboard

Tableau Link

To visualize the data analysis we used Tableau. Our dashboard displays the comparison of the data by the following:

  • Borough
  • Bike Lane v. No Bike Lane
  • Date/Time (Hour, Day/Night, Weekday, Month)
  • Weather

Screen Shot 2022-11-02 at 9 08 23 PM

Conclusions

- More accidents occur off the bike lanes than on bike lanes. 
- There are more accidents when it is clear outside because more people desire to ride bikes in pleasant weather.
- More accidents in the borough of Bronx and the least accidents in the borough of Staten Island. 
- Rush hours tend to have more bike accidents, especially Tuesdays and Fridays.
- Our ML model has an accuracy of over 82% in predicting bike lanes.

^ back to top ^

Presentation

Google Slides File Link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •