This repository contains my final project for the Data Engineering Zoomcamp by DataTalksClub. I built a batch data pipeline that extracts, transforms and loads the New York City Motor Vehicle Collisions dataset into a Data Warehouse in the Google Cloud Platform (GCP).
For my course notes and homework solutions, check my DE Zoomcamp repository. For the steps of how to reproduce this project, see HOW-TO-RUN.md.
According to the dataset description, there is a police report in New York City that must be filled out whenever an accident where a person is injured or killed or there are more than 1000 dollars worth of damage. The dataset contains accident records since july/2012 and is updated on a daily base. As of March 2023, it has around 1.98 million rows.
This project has the goal of answering the following questions:
-
What is the yearly percentage of accidents per borough since 2013?
-
How many people were killed per year since 2013?
-
How many people were injured per year since 2013?
-
What is the contributing factor distribution for accidents that involved a single vehicle since 2013?
-
Prefect dataflows:
-
ETL Web to GCS: fetches data from the NYC Open Data API (Extract), converts the ZIP CODE column to int to avoid mixed type exceptions and creates the CRASH DATETIME column by merging the CRASH DATE and CRASH TIME columns (Transform), and loads the data into GCS (Load).
-
ETL GCS to BigQuery: fetches data from GCS (Extract), transforms string columns by stripping leading and trailing whitespaces, replacing multiple spaces with a single space and bringing all column names to lowercase (Transform), and loads the data into BigQuery (Load).
-
-
Dbt models:
-
stg_crashes: selects a subset of columns from the raw table that was loaded into BigQuery, filtering only records that happened after January 1st, 2013.
-
fact_crashes: selects all data from stg_crashes, partitions it by year and clusters the records by borough and contributing_factor_vehicle_1. Here, the partitioning makes it more efficient to query data and extract statistics by year. With respect to clustering, borough and contributing_factor_vehicle_1 are the main categorical values whose distributions I was interested in seeing when building my dashboard.
-
-
Pandas for fetching the dataset from the API endpoint.
-
Prefect and Prefect Cloud for dataflow implementation and workflow orchestration.
-
Terraform for managing and provisioning infrastructure (GCS bucket, Data Warehouse and Virtual Machine) in GCP.
-
Docker for encapsulating the dataflows and their dependencies into containers, making it easier to deploy them.
-
Data build tool (dbt) for transforming, partitioning and clustering the dataset in the data warehouse.
-
Google Lookerstudio for creating a dashboard to visualize the dataset.
The dashboard is publicly available in this link.
-
The borough information is null for 32% of the records of each year, on average. One way to circumvent this limitation would be to use the NYC Borough Boundaries data for checking where each accident happened based on the reported latitude and longitude pairs and using Geopandas to determine the respective borough.
-
Considering only records that contain the borough information, brooklyn and queens account for more than 50% of the accidents.
-
From 2013 to 2022, the minimum number of people killed in a single year was 231 (in 2018) while the maximum number was 297 (in 2013).
-
From 2013 to 2022, the minimum number of people injured in a single year was 44,615 (in 2020) while the maximum number was 61,941 (in 2018).
-
Besides "unspecified" (56.7%) and "others" (13.2%), the main contributing factor reported for accidents that involved a single vehicle since 2013 was "driver inattention/distraction" (12.4%).