Skip to content

This is Data Capstone Project taken from Python for Data Science and Machine Learning Bootcamp by Jose Portilla. The data itself was originally only records from period 2015-2016 with up to a hundred thousand rows of data. To challenge myself, I am using the new version that contains 650K+ rows of data.

Notifications You must be signed in to change notification settings

ardbramantyo/Python-911-Calls-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Capstone Project – 911 Calls

Overview

Data Resource: Kaggle

911 Calls (Fire, Traffic, EMS) had made from Montgomery County, PA from 2015-2020 and the records contain the following fields:

image

Based on analysis using df.info(), the data contain 663522 rows with 9 columns, there are:

  • lat: String variable, Latitude
  • lng: String variable, Longitude
  • desc: String variable, Description of the Emergency Call
  • zip: String variable, Zipcode
  • title: String variable, Title
  • timeStamp: String variable, YYYY-MM-DD HH:MM
  • twp: String variable, Township
  • addr: String variable, Address
  • e: String variable, Dummy variable (always 1)

Further analysis was conducted to get insight on top 5 zipcodes for 911 calls using:

df['zip'].value_counts().head(5)

The results are:

image

Then another analysis to get top 5 townships (twp) for 911 calls information using:

df['twp'].value_counts().head(5) 

image

To check the total of unique title codes, I used: df['title'].nunique() and there are 148 unique title codes.

Creating new features

In the titles column there are "Reasons/Departments" specified before the title code. These are EMS, Fire, and Traffic. A new column called "Reason" was created using .apply() with a custom lambda expression that contains this string value. For example, if the title column value is EMS: BACK PAINS/INJURY, then the Reason column value would be EMS. The result is:

image

Then based on new column “Reason”, to check the most common reason for a 911 call using: df['Reason'].value_counts().head(1)

image

Now, to plot 911 Calls by region I used Seaborn package and create a countplot. The code : sns.countplot(x='Reason',data=df)

image

Next, I focused on time information. I checked the data type of the objects in the timestamp column using df.info(). And the result:

image

The result for data type in the timestamp column is object and it’s not a number, so actually it’s just a string. So I check the validity to one of those objects using: type(df['timeStamp'].iloc[0]) and the result is:

image

Then I turned the data type in the timestamp column from string into DateTime objects.

image

After get dateTime, I can get specific attribute by calling them. For example: Time = df[‘timeStamp’].iloc[0] Time.hour So, basically we can use Jupyter’s tab method to explore the various attributes we can call. Now that the timestamp column are actually DateTime objects, I use .apply() to create 3 new columns called Hour, Month, and Day of week. I created these columns based off of the timestamp column. df['Hour'] = df['timeStamp'].apply(lambda time: time.hour) df['Month'] = df['timeStamp'].apply(lambda time: time.month) df['Day of Week'] = df['timeStamp'].apply(lambda time: time.dayofweek) The new data will become like this:

image

The next thing I wanted to notice was how the day of week is integer from 0 to 6. Then I use the .map() with this dictionary to map the actual string names to the day of the week: dmap = {0:’Mon’,1:’Tue’,2:’Wed’,3:’Thu’,4:’Fri’,5:’Sat’,6:’Sun’} df['Day of Week'] = df['Day of Week'].map(dmap) And the new data will look like this:

image

Now I want to see more and create a counter plot of the Day of Week column with the hue based off of the Reason column with: sns.countplot(x='Day of Week',data=df)

image

We can see that there’s a bit of drop on Sunday. We can add hue by Reason so we can recreate the plot into these:

image

I changed color palette to viridis and the plot become like this:

image

And the last thing to note is that the legend is actually inside the plot, so I relocate the legend using: plt.legend(bbox_to_anchor=(1.05,1),loc=2,borderaxespad=0)

image

Now, I do the exact same step with the month column and the result is this plot bellow:

image

Next I created group by Month:

image

Next, I created seaborn’s lmplot() to create a linear fit on the number of calls per month. First I must reset the byMonth column to create this plot: Reset: byMonth.reset.index() sns.lmplot(x='Month',y='twp',data=byMonth.reset_index()) And I get

image

From linear fit, generally we can see that as far as what Seaborn have plot month in this model that number of calls goes down from month-1 to month-12. Seaborn also shows shaded area indicating eror and the errors basically grows as you go into these months. This trend is mixed because the data consist multiple year. To create time series plot, first I can extract just only date from timestamp column to new column called Date using this command: df['Date'] = df['timeStamp'].apply(lambda t:t.date())

image

I’m using df.groupby('Date').count().head() to make Date column as the index so all the calls from the same date will add to one date as count. I tried to using df.groupby('Date').count()['lat'].plot(figsize=(14,3)) plt.tight_layout()

to get some information from the plot:

image

We can notice that it looks like there are some significant spikes in the beginning and late of 2018, and something in 2020. I do the same thing on every Reason:

image image image

Next is creating heating map. First I create groupby as variable dayHour: dayHour = df.groupby(by=['Day of Week','Hour']).count()['Reason'].unstack()

image

After that, I create Heat Map using: plt.figure(figsize=(18,6)) sns.heatmap(dayHour, cmap='viridis')

image image

Then, I created the same variable using Date of Week and Month to get this information:

image image

REFERENCE

  1. https://www.kaggle.com/datasets/mchirico/montcoalert

About

This is Data Capstone Project taken from Python for Data Science and Machine Learning Bootcamp by Jose Portilla. The data itself was originally only records from period 2015-2016 with up to a hundred thousand rows of data. To challenge myself, I am using the new version that contains 650K+ rows of data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published