-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_cleaning.rmd
392 lines (329 loc) · 14 KB
/
data_cleaning.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
---
title: "Data Cleaning"
author: "Jason Hammett"
date: "November 22, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#Install function courtesy of CSCI 349
my.install <- function(pkg){
if (!(pkg %in% installed.packages()[,1])){
install.packages(pkg, repos = "https://cran.revolutionanalytics.com/")
}
return (require(pkg, character.only=TRUE))
}
#Knitr
my.install("knitr")
#Decision Tree Libs
my.install("rpart")
my.install("rpart.plot")
#Plotting Libs
my.install("ggplot2")
my.install("ggmap")
my.install("leaflet")
#colorblind settings
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# The palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# To use for fills, add
#scale_fill_manual(values=cbPalette)
# To use for line and point colors, add
#scale_colour_manual(values=cbPalette)
```
## Baltimore Crime Analysis using Open Source Intelligence
### Initial Exploration
We have the following datasets: calls, arrests, and victims.
Let's begin by loading the the "Victims" dataset. Calls is a very large file, and arrests has pretty messy data, so we'll do a separate rmd of EDA on those.
"Victims" is a dataset of victim-based crime. These are physical crimes such as robbery, homicide, or arson.
```{r importing data}
victims <- read.csv("Data/BPD_Part_1_Victim_Based_Crime_Data.csv")
```
Let's look at the structure of the file.
```{r structure}
str(victims)
```
Now a summary.
```{r summary}
summary(victims)
```
I know that I am going to have to be replacing many factors in the cleaning process, so I'll make a function that does it for me.
```{r replace function}
#replaceWith
#Params: vec - vector housing the data (i.e. a column from a dataframe), original, replacement
#Returns a vector
replaceWith <- function(vec, original, replacement, asFactor = TRUE){
if (asFactor){
vec <- as.character(vec)
}
vec[which(vec == original)] <- replacement
if (asFactor){
vec <- factor(vec)
}
vec
}
#Blanks to NAs, automatically convert "" to NA
#Param: vec - vector, i.e. a column of a dataframe
blanksToNA <- function(vec){
vec[which(vec == "")] <- NA
vec
}
```
We've learned that this is a victim based crime dataset.
Firstly, here are the factors that make up the victims dataset.
* Crime Date - MM/DD/YYYY
* CrimeTime - HH:mm:ss
* CrimeCode
* Location
* Description
* Inside.Outside - can be transformed to a logical
* Weapon
* Post
* District
* Neighborhood
* Longitude
* Latitude
* Location.1 - A character verion of the latitude longitude pair
* Premise
Now time to clean the data.
####Inside or Outside
First, I will uniformly make Inside.Outside either "Inside" or "Outsside"
```{r inside-outside}
insideOutsideVector <- victims$Inside.Outside
#Convert single letter to full name
insideOutsideVector <- replaceWith(insideOutsideVector, "O", "Outside")
insideOutsideVector <- replaceWith(insideOutsideVector, "I", "Inside")
#Change blank's to NAs
insideOutsideVector <- blanksToNA(vec = insideOutsideVector)
#Remove unnecessary levels
insideOutsideVector <- factor(insideOutsideVector)
#Add it back into the data.frame
victims$Inside.Outside <- insideOutsideVector
```
####Premise
Looking at the levels of the Premise attribute shows inconsistency with the data.
```{r premise}
levels(victims$Premise)
```
The inconsistency is beyond simply capitalization. Phrases are cut off. There are not a uniform descriptors for similar types of premise.
For instance, there is "BAR" as well as "TAVERN/NIG". I am going to presume this is "NIGHTCLUB" cut off.
There's "Church" and "RELIGOUS", so I'll need to manually edit these.
```{r premises}
premises <- victims$Premise
#First, change blanks to NAs
premises <- blanksToNA(premises)
#Now, I'll capitalize every element and reform the levels
premises <- toupper(premises)
#Recalculate the levels
premises <- factor(premises)
#Place it back in dataframe
victims$Premise <- premises
#Clean up
rm(premises)
```
To facilitate more data, I'll create a new variable with cleaner premises and specify the old one PremiseDetailed. I'll also create a column called PremiseCategory that is RESIDENTIAL, BUSINESS, GOVERNMENT, INDUSTRIAL, PARK, TRANSPORT, PUBLICSERVICE, NIGHTLIFE, STREET, VACANT PROPERTY, RECREATION
```{r premises detail}
victims$PremiseDetailed <- victims$Premise
victims$PremiseCategory <- victims$Premise
#It'll be usefull to compare the summary of premise before and after we make changes
premise <- victims$Premise
category <- as.character(premise)
summary(premise)
#I'll begin by combing food levels into simply restaurant
premise <- replaceWith(premise, "FAST FOOD","RESTAURANT")
premise <- replaceWith(premise, "CARRY OUT", "RESTAURANT")
premise <- replaceWith(premise, "CHAIN FOOD", "RESTAURANT")
premise <- replaceWith(premise, "PIZZA/OTHE", "RESTAURANT")
premise <- replaceWith(premise, "BAKERY", "RESTAURANT")
#Nightlife
premise <- replaceWith(vec = premise, original = "TAVERN/NIG", replacement = "BAR")
#I'll move on to residential
premise <- replaceWith(premise, "APT/CONDO", "APARTMENT")
premise <- replaceWith(premise, "APT. LOCKE", "APARTMENT")
#Houses, this is more complicated
premise <- replaceWith(premise, "ROW/TOWNHO", "HOME")
premise <- replaceWith(premise, "SINGLE HOU", "HOME")
premise <- replaceWith(premise, "DWELLING", "HOME")
premise <- replaceWith(premise, "PORCH/DECK", "HOME")
premise <- replaceWith(premise, "YARD", "HOME")
premise <- replaceWith(premise, "PUBLIC HOU", "PUBLIC HOUSING")
premise <- replaceWith(premise, "MOBILE HOM", "HOME")
premise <- replaceWith(premise, "OTHER/RESI", "OTHER RESIDENTIAL")
#An important element of crime, vacant property
premise <- replaceWith(premise, "VACANT LOT", "VACANT PROPERTY")
premise <- replaceWith(premise, "VACANT BUI", "VACANT PROPERTY")
premise <- replaceWith(premise, "VACANT DWE", "VACANT PROPERTY")
#Schools
premise <- replaceWith(premise, "PUBLIC SCH", "SCHOOL")
premise <- replaceWith(premise, "PRIVATE SC", "SCHOOL")
premise <- replaceWith(premise, "SCHOOL PLA", "SCHOOL")
#Industrial
premise <- replaceWith(premise, "MANUFACTUR", "MANUFACTURING")
premise <- replaceWith(premise, "UTILITIES-", "UTILITIES")
premise <- replaceWith(premise, "SHED/GARAG", "SHED/GARAGE")
#Transport
premise <- replaceWith(premise, "PARKING LO", "PARKING LOT")
premise <- replaceWith(premise, "BRIDGE-PIE", "BRIDGE-PIER")
premise <- replaceWith(premise, "TRUCKING &", "TRUCKING")
premise <- replaceWith(premise, "TRACTOR TR", "TRACTOR TRAILER")
#Health
premise <- replaceWith(premise, "HOSP/NURS.", "HOSPITAL")
premise <- replaceWith(premise, "DAY CARE F", "DAY CARE")
#Other businesses
premise <- replaceWith(premise, "BANK/FINAN", "BANK")
premise <- replaceWith(premise, "LIQUOR STO", "LIQUOR STORE")
premise <- replaceWith(premise, "CONVENIENC", "CONVENIENCE STORE")
premise <- replaceWith(premise, "OFFICE BUI", "OFFICE BUILDING")
premise <- replaceWith(premise, "GROCERY/CO", "GROCERY")
premise <- replaceWith(premise, "DOCTORS OF", "DOCTORS OFFICE")
premise <- replaceWith(premise, "HOTEL/MOTE", "HOTEL")
premise <- replaceWith(premise, "JEWELRY ST", "JEWELRY STORE")
premise <- replaceWith(premise, "PHOTO STUD", "PHOTO STUDIO")
premise <- replaceWith(premise, "GAS STATIO", "GAS STATION")
premise <- replaceWith(premise, "OTHERS\ - IN", "OTHER INDOORS")
premise <- replaceWith(premise, "SHOPPING M", "SHOPPING MALL")
premise <- replaceWith(premise, "MINI STORA", "MINI STORAGE")
premise <- replaceWith(premise, "RETAIL/SMA", "RETAIL")
premise <- replaceWith(premise, "CAR REPAI", "CAR REPAIR")
premise <- replaceWith(premise, "SALESMAN/C", "SALESMAN")
#premise <- replaceWith(premise, "RENTAL/VID", "") #Don't know enough to answer this, rental/video store?
premise <- replaceWith(premise, "BUS. PARK", "BUSINESS PARK")
premise <- replaceWith(premise, "WHOLESALE/", "WHOLESALE")
premise <- replaceWith(premise, "LAUNDRY/CL", "LAUNDRY")
premise <- replaceWith(premise, "BARBER/BEA", "SALON")
#Outdoors
premise <- replaceWith(premise, "INNER HARB", "INNER HARBOR")
premise <- replaceWith(premise, "OTHER - OU", "OTHER OUTDOORS")
premise <- replaceWith(premise, "PUBLIC ARE", "PUBLIC AREA")
premise <- replaceWith(premise, "PUBLIC BUI", "PUBLIC BUILDING")
premise <- replaceWith(premise, "CAR LOT-NE", "CAR LOT")
#Government Entities
premise <- replaceWith(premise, "POLICE DEP", "POLICE")
premise <- replaceWith(premise, "COURT HOUS", "COURT")
premise <- replaceWith(premise, "FIRE DEPAR", "FIRE")
premise <- replaceWith(premise, "PENITENTIA", "PENITENTIARY")
#Construction
premise <- replaceWith(premise, "CONSTRUCTIO", "CONSTRUCTION")
premise <- replaceWith(premise, "CONSTRUCTI", "CONSTRUCTION")
# premise <- replaceWith(premise, "BLDG UNDER", "CONSTRUCTION") # Not sure what BLDG UNDER means, could be construction
```
Creating a more general premise factor that can be easier to generalize
```{r premise category}
#Create the category, more general
category <- as.character(premise)
#Residential
category <- replaceWith(category, "APARTMENT", "RESIDENTIAL", asFactor = FALSE)
category <- replaceWith(category, "HOME", "RESIDENTIAL", asFactor = FALSE)
category <- replaceWith(category, "PUBLIC HOUSING", "RESIDENTIAL", asFactor = FALSE)
category <- replaceWith(category, "OTHER RESIDENTIAL", "RESIDENTIAL", asFactor = FALSE)
#Nightlife
category <- replaceWith(category, "BAR", "NIGHTLIFE", asFactor = FALSE)
#Business categories
category <- replaceWith(category, "RESTAURANT", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "SCHOOL", "SCHOOL", asFactor = FALSE)
category <- replaceWith(category, "BANK", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "LIQUOR STORE", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "DRUG STORE", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "JEWELRY STORE", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "PAWN SHOP", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "CONVENIENCE STORE", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "GAS STATION", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "DOCTORS OFFICE", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "HOTEL", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "OFFICE BUILDING", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "BANK", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "GROCERY", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "PHOTO STUDIO", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "SHOPPING MALL", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "CARRY OUT", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "FINANCE/LO", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "WHOLESALE", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "SALON", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "LAUNDRY", "BUSINESS", asFactor = FALSE)
category <- replaceWith(category, "SALON", "SALESMAN", asFactor = FALSE)
#Abandoned Property
categpry <- replaceWith(category, "VACANT PROPERTY", "VACANT", asFactor = FALSE)
#Recreation
category <- replaceWith(category, "PARK", "RECREATION", asFactor = FALSE)
category <- replaceWith(category, "PLAYGROUND", "RECREATION", asFactor = FALSE)
category <- replaceWith(category, "RACE TRACK", "RECREATION", asFactor = FALSE)
#Industrial
category <- replaceWith(category, "MANUFACTURING", "INDUSTRIAL", asFactor = FALSE)
category <- replaceWith(category, "WAREHOUSE", "INDUSTRIAL", asFactor = FALSE)
category <- replaceWith(category, "UTILITIES", "INDUSTRIAL", asFactor = FALSE)
#Transportation
category <- replaceWith(category, "CAB", "TRANSPORTATION", asFactor = FALSE)
category <- replaceWith(category, "BOAT/SHIP", "TRANSPORTATION", asFactor = FALSE)
category <- replaceWith(category, "BUS/AUTO", "TRANSPORTATION", asFactor = FALSE)
category <- replaceWith(category, "LIGHT RAIL", "TRANSPORTATION", asFactor = FALSE)
category <- replaceWith(category, "TRACTOR TRAILER", "TRANSPORTATION", asFactor = FALSE)
#Add PremiseCategory Back into the dataframe, removing unused levels
victims$PremiseCategory <- factor(category)
#Add PremiseDetailed back into the dataframe, removing unused levels
victims$PremiseDetailed <- factor(premise)
```
####Neighborhoods
Neighborhoods are a much more consistent vector.
```{r neighborhoods}
#Replace blanks with NAs
neighborhoods <- victims$Neighborhood
neighborhoods <- blanksToNA(neighborhoods)
neighborhoods <- factor(neighborhoods)
victims$Neighborhood <- neighborhoods
#Clean up
rm(neighborhoods)
```
####Location
Location has many, many levels. Note that the column Location.1 is the lat lon pair
```{r locations}
location <- victims$Location
#Convert blanks to NA
location <- blanksToNA(location)
#Reset levels
location <- factor(location)
#Reassign
victims$Location <- location
#Clean up
rm(location)
```
Location.1 is the lat/lon pair. I'll clean this and rename it latLon.
```{r latlon}
latLon <- victims$Location.1
#Change blanks to NAs
latLon <- blanksToNA(latLon)
victims$LatLon <- factor(latLon)
victims$Location.1 <- NULL
#Clean up
rm(latLon)
```
#### Weapons
```{r}
#Grab em
weapon <- victims$Weapon
#Replace blank with NA
weapon <- blanksToNA(weapon)
#Reassign levels
weapon <- factor(weapon)
#Pop back into data frame
victims$Weapon <- weapon
```
Now it's time for a more complicated task: date.
There are two factor variables representing date and time. I'll combine them into one Date object
```{r}
#Append Time to Date
victims$CrimeDateTime <- paste(victims$CrimeDate, victims$CrimeTime)
#Let's look at it
str(victims$CrimeDateTime)
#Now let's convert these to date objects
victims$CrimeDateTime <- as.POSIXct(victims$CrimeDateTime, format = "%m/%d/%Y %H:%M:%S")
#Let's look at the new version
str(victims$CrimeDateTime)
#Let's wipe the uneeded Date and Time factors
victims$CrimeDate <- NULL
victims$CrimeTime <- NULL
```
Finally, let's save the workplace image so we can pick up from here.
```{r saving image}
save.image(file = "cleanedData.RData")
```