p8105_hw2_zdz2101

Zelos Zhu 10/3/2018

Load Packages

library(tidyverse)
library(readxl)
library(p8105.datasets)

Problem 1

#Read in the data and then tidy it
nyc_transit_df <- 
  read_csv("Data Files/NYC_Transit_Subway_Entrance_And_Exit_Data.csv") %>%
  janitor::clean_names() %>%
  select(., line, station_name, station_latitude, station_longitude, route1, route2, route3, route4, route5, route6,
                route7, route8, route9, route10, route11, entrance_type, entry, vending, ada) %>% 
  mutate(entry = recode(entry, "YES" = TRUE, "NO" = FALSE))

#Distinct Stations -- unite just line name and station name
distinct_stations1 <- unique(paste(nyc_transit_df$line, nyc_transit_df$station_name, sep = "/"))
head(distinct_stations1)

## [1] "4 Avenue/25th St" "4 Avenue/36th St" "4 Avenue/45th St"
## [4] "4 Avenue/53rd St" "4 Avenue/59th St" "4 Avenue/77th St"

length(distinct_stations1)

## [1] 465

#Distinct Stations -- if you include Route number, in theory should come out to be the same length
#I expand on the discrepancy in my written solution
distinct_stations2 <- unique(gsub("/NA", "", apply(nyc_transit_df[,names(nyc_transit_df)[c(1,2,5:15)]], 1, paste , collapse = "/" )))
head(distinct_stations2)

## [1] "4 Avenue/25th St/R"   "4 Avenue/36th St/N/R" "4 Avenue/45th St/R"  
## [4] "4 Avenue/53rd St/R"   "4 Avenue/59th St/N/R" "4 Avenue/77th St/R"

length(distinct_stations2)

## [1] 467

#Table of ADA compliance (TRUEs and FALSEs)
table(nyc_transit_df$ada)

## 
## FALSE  TRUE 
##  1400   468

#2-way table of vending and entrances, easy way to illustrate what we're looking for
table(data.frame(entry=nyc_transit_df$entry, vending=nyc_transit_df$vending))

##        vending
## entry     NO  YES
##   FALSE  114    1
##   TRUE    69 1684

Write a short paragraph about this dataset – explain briefly what variables the dataset contains, describe your data cleaning steps so far, and give the dimension (rows x columns) of the resulting dataset. Are these data tidy?

The dataset contains information about subway stops: name of the stop, where the stop is, what lines runs through it, the type of entrance at the specific stop, whether or not it's wheelchair friendly and whether or not vendors exist at that station.

In terms of cleaning we've done, we selected the columns that was requested of us: line, station, name, station latitude / longitude, routes served, entry, vending, entrance type, and ADA compliance. Then we changed the entry variable to be TRUE/FALSE vs YES/NO. After all this we have a result of 1868 observations and 19 variables. I would consider this data tidy.

Answer the following questions using these data:

How many distinct stations are there? Note that stations are identified both by name and by line (e.g. 125th St A/B/C/D; 125st 1; 125st 4/5); the distinct function may be useful here.
How many stations are ADA compliant?
What proportion of station entrances / exits without vending allow entrance?

Solution

There are 465 distinct stations. However if you dig a little deeper, you see that if you were to combine route into the line/station names you end up with 467 distinct stations, a discrepancy of 2 stations. It's important to shine a light on these stations so the first one is: "Queens Boulevard/Forest Hills-71st Av" has a station that serves the "e", a miscoded letter for "E" (row 1628 of the original data). The other station is "6 Avenue/47-50th Sts Rockefeller Center/B/D/F/M/7" (row 109 of the original data) where the 7 does not actually serve this stop; I doubled check with the MTA official stops, another miscode in the dataset.
Among the total 1868 stations, there are 468 ADA compliant stations.
Among the 183 stations that don't have vending, 0.6052632 of those stations allow entrance.

Reformat data so that route number and route name are distinct variables. How many distinct stations serve the A train? Of the stations that serve the A train, how many are ADA compliant?

#Reformat data so that route number and route name are distinct variables
nyc_transit_df_reformated <-  nyc_transit_df %>% 
  gather( key = route_num, value = route_name, route1:route11) %>%
  mutate(route_name = tolower(route_name))
  
head(nyc_transit_df_reformated)

## # A tibble: 6 x 10
##   line  station_name station_latitude station_longitu… entrance_type entry
##   <chr> <chr>                   <dbl>            <dbl> <chr>         <lgl>
## 1 4 Av… 25th St                  40.7            -74.0 Stair         TRUE 
## 2 4 Av… 25th St                  40.7            -74.0 Stair         TRUE 
## 3 4 Av… 36th St                  40.7            -74.0 Stair         TRUE 
## 4 4 Av… 36th St                  40.7            -74.0 Stair         TRUE 
## 5 4 Av… 36th St                  40.7            -74.0 Stair         TRUE 
## 6 4 Av… 45th St                  40.6            -74.0 Stair         TRUE 
## # ... with 4 more variables: vending <chr>, ada <lgl>, route_num <chr>,
## #   route_name <chr>

#Filter by A stations using this gather method
A_stations <- nyc_transit_df_reformated %>%
  filter(route_name == "a")

distinct_A_stations <- unique(A_stations$station_name)

A_stations_ada_comp <- A_stations$station_name[which(A_stations$ada == TRUE)]

There are 56 distinct stations that serve the A train.
Among the total 273 A stations, 107 are ADA compliant.

Problem 2

This problem uses the Mr. Trash Wheel dataset, available as an Excel file on the course website.

Read and clean the Mr. Trash Wheel sheet: -specify the sheet in the Excel file and to omit columns containing notes (using the range argument and cell_cols() function) -use reasonable variable names -omit rows that do not include dumpster-specific data -rounds the number of sports balls to the nearest integer and converts the result to an integer variable (using as.integer).

Read and clean precipitation data for 2016 and 2017. For each, omit rows without precipitation data and add a variable year. Next, combine datasets and convert month to a character variable (the variable month.name is built into R and should be useful).

mr_trash_wheel <- read_excel("Data Files/HealthyHarborWaterWheelTotals2018-7-28.xlsx", sheet = "Mr. Trash Wheel", range = cell_cols("A:N")) %>%
  janitor::clean_names() %>%
  filter(!grepl("Total", month) & !is.na(dumpster) ) %>%
  mutate(sports_balls = as.integer(round(sports_balls))) %>%
  mutate(weekday_collected = factor(weekdays(date), levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")))

precip_2016_df <- read_excel("Data Files/HealthyHarborWaterWheelTotals2018-7-28.xlsx", sheet = "2016 Precipitation", range = "A2:B14") %>%
  janitor::clean_names() %>%
  mutate(year = rep(2016, nrow(.)))

precip_2017_df <- read_excel("Data Files/HealthyHarborWaterWheelTotals2018-7-28.xlsx", sheet = "2017 Precipitation", range = "A2:B14") %>%
  janitor::clean_names() %>%
  mutate(year = rep(2017, nrow(.)))

precip_df <- rbind(precip_2016_df, precip_2017_df) %>% 
  mutate(month = month.name[month]) %>%
  select(month, year, total)

Write a paragraph about these data; you are encouraged to use inline R. Be sure to note the number of observations in both resulting datasets, and give examples of key variables. For available data, what was the total precipitation in 2017? What was the median number of sports balls in a dumpster in 2016?

The Mr. Trash Wheel dataset contains 285 observations with 15 variables. The precipitation data contains 24 observations with 3 variables.Some fun key examples:

Dumpster IDs 6, 22, 23, 29, 44, 64, 65, 67, 68, 69, 70, 71, 87, 89, 90, 93, 94, 96, 97, 101, 105, 117, 123, 126, 127, 135, 138, 151, 161, 164, 202, 216, 220, 227, 231, 232, 233, 234, 239, 240, 241, 242, 265, 266, 267, 268, 269, 270, 275, 276, 277, 278 reported their data from Tuesdays.
Dumpster IDs 61, 62, 67, 72, 74, 75, 82, 87, 96, 97, 99, 102, 107, 113 collected more than 4 tons in 2015.
Between May 2015 and August 2017, a total of 561180 plastic bottles was collected.
Dumpster IDs collected 18 or more cubic yards in 2016.

Total precipitation in 2017 was 32.93.

In 2016, median number of sports balls in a dumpster was 26 balls.

Problem 3

This problem uses the BRFSS data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package.

For this question: -format the data to use appropriate variable names; -focus on the “Overall Health” topic -exclude variables for class, topic, question, sample size, and everything from lower confidence limit to GeoLocation -structure data so that responses (excellent to poor) are variables taking the value of Data_value (need to go OH) -create a new variable showing the proportion of responses that were “Excellent” or “Very Good”

data("brfss_smart2010")
brfss_smart2010 <- janitor::clean_names(brfss_smart2010) %>%
  filter(topic == "Overall Health") %>%
  select(year, locationabbr, locationdesc, response, data_value) 

brfss_smart2010_spread <- spread(brfss_smart2010, response, data_value) %>%
  janitor::clean_names() %>%
  mutate(exc_vgood = excellent + very_good) %>%
  select(year, locationabbr, locationdesc, poor, fair, good, very_good, excellent, exc_vgood)

Using this dataset, do or answer the following:

1) How many unique locations are included in the dataset? Is every state represented? What state is observed the most?

#How many unique locations are included in the dataset?  
length(unique(brfss_smart2010$locationdesc))

## [1] 404

#Is every state represented?
length(unique(brfss_smart2010$locationabbr)) #more than 50 b/c we counting DC

## [1] 51

#What state is observed the most?
sort(table(brfss_smart2010$locationabbr), decreasing = TRUE) #New Jersey

## 
##  NJ  FL  NC  WA  MD  MA  TX  NY  SC  CO  OH  PA  NE  CA  UT  NH  VT  CT 
## 730 610 575 485 450 395 355 325 315 295 295 295 265 260 250 240 240 235 
##  LA  NM  OK  KS  RI  MI  MN  OR  AZ  ID  HI  ME  DE  GA  TN  IL  MO  MS 
## 225 215 200 190 190 170 165 165 160 160 155 155 135 135 130 125 125 115 
##  WY  AR  IN  AL  MT  ND  NV  SD  IA  AK  DC  KY  WI  WV  VA 
## 110 105 105  90  90  90  90  90  70  55  45  45  45  45  20

There are 404 unique locations.
All 50 states are represented but if you include Washington DC as a "state" then we have 51.
New Jersey was observed the most.

2) In 2002, what is the median of the “Excellent” response value?

median_exc_2002 <- median(brfss_smart2010_spread$excellent[which(brfss_smart2010_spread$year == 2002 )], na.rm = TRUE)

The median of Excellent response value in 2002 was 23.6.

3) Make a histogram of “Excellent” response values in the year 2002.

excellent_histo <- brfss_smart2010_spread %>% filter(year == 2002)

(ggplot(excellent_histo, aes(x = excellent))
 + geom_histogram(alpha = 0.5, color = "blue", fill = "yellow")
 + theme_minimal())

## Warning: Removed 2 rows containing non-finite values (stat_bin).

4) Make a scatterplot showing the proportion of “Excellent” response values in New York County and Queens County (both in NY State) in each year from 2002 to 2010.

ny_excellent <- filter(brfss_smart2010_spread, locationdesc %in% c("NY - New York County","NY - Queens County"))

(ggplot(ny_excellent, aes(x=year, y=excellent))
  + geom_point(aes(color=locationdesc))
  + theme_minimal())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p8105_hw2_zdz2101.md

p8105_hw2_zdz2101.md

p8105_hw2_zdz2101

Load Packages

Problem 1

Write a short paragraph about this dataset – explain briefly what variables the dataset contains, describe your data cleaning steps so far, and give the dimension (rows x columns) of the resulting dataset. Are these data tidy?

Answer the following questions using these data:

Solution

Reformat data so that route number and route name are distinct variables. How many distinct stations serve the A train? Of the stations that serve the A train, how many are ADA compliant?

Problem 2

Problem 3

Using this dataset, do or answer the following:

1) How many unique locations are included in the dataset? Is every state represented? What state is observed the most?

2) In 2002, what is the median of the “Excellent” response value?

3) Make a histogram of “Excellent” response values in the year 2002.

4) Make a scatterplot showing the proportion of “Excellent” response values in New York County and Queens County (both in NY State) in each year from 2002 to 2010.

Files

p8105_hw2_zdz2101.md

Latest commit

History

p8105_hw2_zdz2101.md

File metadata and controls

p8105_hw2_zdz2101

Load Packages

Problem 1

Write a short paragraph about this dataset – explain briefly what variables the dataset contains, describe your data cleaning steps so far, and give the dimension (rows x columns) of the resulting dataset. Are these data tidy?

Answer the following questions using these data:

Solution

Reformat data so that route number and route name are distinct variables. How many distinct stations serve the A train? Of the stations that serve the A train, how many are ADA compliant?

Problem 2

Problem 3

Using this dataset, do or answer the following:

1) How many unique locations are included in the dataset? Is every state represented? What state is observed the most?

2) In 2002, what is the median of the “Excellent” response value?

3) Make a histogram of “Excellent” response values in the year 2002.

4) Make a scatterplot showing the proportion of “Excellent” response values in New York County and Queens County (both in NY State) in each year from 2002 to 2010.