Refactor Standard Survey Creation #12

jhelsel11 · 2018-10-08T20:48:35Z

No description provided.

jhelsel11 · 2018-10-08T20:49:13Z

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project?

jhelsel11 · 2018-10-08T21:03:41Z

Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)?

jhelsel11 · 2018-10-08T21:12:32Z

Q3) Is there a reason to keep so many copies of dataframes? (e.g. the progression for survey.BAR. ...)

jhelsel11 · 2018-10-08T21:53:16Z

Q4) In the Caltrain survey there are survey IDs with letters, so the as.numeric coercion creates NA values. I'm leaving them as characters for now, but am unsure how you want this handled. (Roughly same problem for Muni.)

jhelsel11 · 2018-10-08T22:59:05Z

Q5) Why did we track survey_tech if it gets dropped in survey_to_flat?

jhelsel11 · 2018-10-08T23:35:39Z

Q6) How would you like me to handle the "Bespoke survey technology"? Put this into a csv? Leave it alone?

DavidOry · 2018-10-09T15:27:24Z

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project?

What do you propose as the scope? One Rproject for the entire repository?

DavidOry · 2018-10-09T15:28:12Z

Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)?

Are you suggesting as a global setting? Or specific to each data read?

jhelsel11 · 2018-10-09T15:53:32Z

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project?

What do you propose as the scope? One Rproject for the entire repository?

I'm not sure I'd go that far. I just wanted to better understand why you selected the scope of the project you did.

jhelsel11 · 2018-10-09T15:55:10Z

Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)?

Are you suggesting as a global setting? Or specific to each data read?

As a global option. (I've implemented that for now, but can remove it.) So far I haven't seen anything implementing a modeling approach that would use factors and I've found that factors are much more of a headache in overhead than just treating everything as a string.

DavidOry · 2018-10-09T18:02:00Z

make-uniform/production/Build Standard Database.Rmd


 #### _ISSUES_
 1.  Work carefully through first few datasets, as the code is a bit coarse in places

 #### _TODO_
+1. New users should add their username and relative file path to 


@jhelsel11: Given that this database will be rather small, may it be easier just to list the user names and dirs here and build the dataframe rather than using a separate CSV?

DavidOry · 2018-10-09T18:04:09Z

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project? What do you propose as the scope? One Rproject for the entire repository? I'm not sure I'd go that far. I just wanted to better understand why you selected the scope of the project you did.
--

The nature of working at an MPO. Where you can do a little bit for a few weeks, then stop, then start up six months later. So it was developed piecemeal, i.e., nothing intentional or thoughtful about the current design. I do encourage you to update the Rproject files so that they contain a logical scope, i.e., navigating around in RStudio makes sense, the working dir makes sense, etc.

DavidOry · 2018-10-09T18:05:53Z

      Q3) Is there a reason to keep so many copies of dataframes? (e.g. the progression for survey.BAR. ...)

No good reason. I was likely getting familiar with dplyr at the time and was exploring ways to use it. It does allow for easier debugging, which is less important moving forward (i.e., the code has been used a lot more now).

DavidOry · 2018-10-09T18:07:40Z

      Q4) In the Caltrain survey there are survey IDs with letters, so the as.numeric coercion creates NA values. I'm leaving them as characters for now, but am unsure how you want this handled. (Roughly same problem for Muni.)

Hmm. Not sure. But agree we should move ahead with characters. Likely just a bug.

DavidOry · 2018-10-09T18:08:51Z

make-uniform/production/Build Standard Database.Rmd

+  filter(user == me) %>%
+  .$path_switch
+
+f_spatial_to_be_geocoded_path = paste0(dir_path, "Data and Reports/_geocoding Standardized/spatial_places_to_be_geocoded.csv")


Given that you have Data and Reports written over and over again, may it be better (i.e., easier to read) to assume this part of the path lives in dir_path?

DavidOry · 2018-10-09T18:11:06Z

make-uniform/production/Build Standard Database.Rmd

-survey.VTA.select <- survey.VTA[ ,variables.VTA]
+survey_VTA <- read.csv(f_vta_survey_path, header = TRUE)
+
+# # JWH: I don't understand why this code is needed_


When the data is being read in, it's, right now, a somewhat iterative and tedious process to make sure the dictionary includes all the variables you want it to include. So we should add a note here that we need some type of helper method like this when reading in a new survey. I'll create an issue for this and we can address it when we start using the code to set up new operators. For now, this code snippet can be removed.

DavidOry · 2018-10-09T18:18:10Z

      Q5) Why did we track survey_tech if it gets dropped in survey_to_flat?

It gets added back in:

survey_adds <- survey_combine %>%
  mutate(survey_year = as.character(survey_year)) %>%
  group_by(ID, operator, survey_year, survey_tech) %>%
  summarise(count = n()) %>%
  select(-count)

survey_flat <- survey_flat %>%
  left_join(survey_adds, by = c("ID", "operator", "survey_year"))

DavidOry · 2018-10-09T18:21:36Z

make-uniform/production/Build Standard Database.Rmd

-  mutate(survey_tech = ifelse(operator == "VTA" & str_count(route, pattern = fixed("902", ignore_case = TRUE)) > 0L, "light rail", survey_tech))
+# There was a flag in the original conversion text for VTA route 185, but no 
+# records had a route 185, so it has been dropped from the csv.
+survey_tech_update <- read.csv("bespoke_survey_tech_replacement.csv")


I'd recommend explicitly reading this in as characters (colClasses = 'character') given the mix of data types you're dealing with in the database.

jhelsel11 · 2018-10-09T21:46:35Z

Q7) The Standardize year born step (creation of survey_standard) implicitly changes Missing codes to NA. Do you want to continue with that implicit conversion or should that step be explicit?

jhelsel11 · 2018-10-09T22:40:54Z

Q8) @DavidOry I think the recode key variables from NA to 'missing' step should be handled in 2 mutate_at steps, but I can't get the sytnax to work.

Something like the following for the first step:
survey_standard <- survey_standard %>%
mutate_at(c("orig_purp", "dest_purp", "work_status", "student_status", "approximate_age"),
~ifelse(is.na(.), "missing", .)

R doesn't like the ifelse inside mutate_at, but I don't really understand why.

jhelsel11 · 2018-10-09T23:47:24Z

Q9) I'm deep into reformatting the Build standard variables chunk and trying to keep lines under 80 characters. That's leading to its own weirdness. I think it would be helpful if @DavidOry took a look to see if that's a formatting you like or whether I should roll back those changes.

DavidOry · 2018-10-10T11:59:25Z

(1) the 80 chars width should be broken if it improves readability.
(2) i'll take a look at the mutate_at later today.

jhelsel11 · 2018-10-10T16:06:12Z

(1) the 80 chars width should be broken if it improves readability.
(2) i'll take a look at the mutate_at later today.

I guess the question is whether it degrades readability. I like having each clause on its own line, but it does require you to scroll more.

jhelsel11 · 2018-10-10T16:23:50Z

Q7) The Standardize year born step (creation of survey_standard) implicitly changes Missing codes to NA. Do you want to continue with that implicit conversion or should that step be explicit?

As a more general question, should I take extra steps to ensure that there are no implicit NA conversions?

DavidOry · 2018-10-16T19:05:39Z

make-uniform/production/Build Standard Database.Rmd

@@ -108,19 +107,19 @@ f_vta_survey_path <- paste0(dir_path,
  "VTA/As CSV/VTA_DRAFTFINAL_20171114 NO POUND OR SINGLE QUOTE.csv")

 f_output_rdata_path <- paste0(dir_path, 
-  "_data Standardized/survey_standard.RData")
+  "_data Standardized/survey_standard.RDS")


@jhelsel11: When using saveRDS, we should use the RDS extension, so users know that it's an RDS file and not an Rdata file.

DavidOry · 2018-10-16T19:07:18Z

make-uniform/production/compare_against_previous.R


-table(thin_df$var_name)
+# update the Caltrain IDs and do again
+update_current_df <- current_df %>%


@jhelsel11: you are correct re: Caltrain bug. When the IDs are updated to remove the leading "S", the previous and current data sets match.

# Conflicts: # decomposition/DecompositionAnalysis_SI.r # make-uniform/production/Build Standard Database.Rmd

jhelsel11 · 2018-10-31T21:11:35Z

@shimonisrael
Would you consider the AC Transit Transbay buses express buses or local buses? Dave has asked me to make sure the AC Transit information is running properly and I'm taking a look at the bespoke survey tech file. My thought was to treat all lettered buses as express (since they cross the bay)?

shimonisrael · 2018-10-31T22:30:09Z

@shimonisrael
Would you consider the AC Transit Transbay buses express buses or local buses? Dave has asked me to make sure the AC Transit information is running properly and I'm taking a look at the bespoke survey tech file. My thought was to treat all lettered buses as express (since they cross the bay)?

@jhelsel11 Yes, please. Those should all be express. If you could, please check for the free Broadway Shuttle in Oakland, which may be an exception as a lettered route. Otherwise all the lettered buses are express. Also, if I've not responded elsewhere to any questions, I'm a little behind so always feel free to email if I've overlooked anything. Thanks.

jhelsel11 · 2018-10-31T22:34:50Z

@shimonisrael The broadway shuttle is coded as local.

jhelsel11 · 2018-11-01T18:53:32Z

@DavidOry The vehicle and worker dictionaries seem unnecessary. 1) They include lots of values that aren't used in the dictionary (e.g. 'twelve') and 2) couldn't this just be done in the dictionary crosswalk?

Hopefully I'm not misreading this, but I'll make the change after lunch.

jhelsel11 · 2018-11-01T20:13:47Z

@DavidOry The vehicle and worker dictionaries seem unnecessary. 1) They include lots of values that aren't used in the dictionary (e.g. 'twelve') and 2) couldn't this just be done in the dictionary crosswalk?

Hopefully I'm not misreading this, but I'll make the change after lunch.

Never mind. After looking at the auto sufficiency code, I've decided to leave this alone for now. Still, it would be helpful to understand why the codes go up to "Six or more" and the numerical categories for the division only go up to 4.

jhelsel11 · 2018-11-01T20:37:53Z

@DavidOry Why are the "transfer to and from" and "first boarding and last alighting technology" codes asymmetric? The transfer code looks as far back prior to boarding the surveyed route. The technology code looks as far to the end after the surveyed route. I don't believe there's anything wrong, I just don't understand why.

DavidOry · 2018-11-01T20:42:53Z

@DavidOry Why are the "transfer to and from" and "first boarding and last alighting technology" codes asymmetric? The transfer code looks as far back prior to boarding the surveyed route. The technology code looks as far to the end after the surveyed route. I don't believe there's anything wrong, I just don't understand why.

I do not know. Can you point me to the code segment that you have concerns with?

jhelsel11 · 2018-11-01T20:43:48Z

# Step 6:  Travel Model One path details
# -------------------------------------------------------------------------------------

# Transfer to and from
survey_standard <- survey_standard %>%
  mutate(transfer_from = as.character(first_before_operator)) %>%
  mutate(transfer_from = ifelse(second_before_operator != 'None', 
                                as.character(second_before_operator), 
                                transfer_from)) %>%
  
  mutate(transfer_from = ifelse(third_before_operator  != 'None', 
                                as.character(third_before_operator),
                                transfer_from)) %>%
  
  mutate(transfer_to = as.character(first_after_operator))

# First boarding and last alighting technology
survey_standard <- survey_standard %>%
  mutate(first_board_tech = as.character(survey_tech)) %>%
  mutate(first_board_tech = ifelse(!is.na(first_before_technology), 
                                   first_before_technology, 
                                   first_board_tech)) %>%
  
  mutate(last_alight_tech = as.character(survey_tech)) %>%
  mutate(last_alight_tech = ifelse(!is.na(first_after_technology),
                                   first_after_technology,
                                   last_alight_tech)) %>%
  
  mutate(last_alight_tech = ifelse(!is.na(second_after_technology), 
                                   second_after_technology, 
                                   last_alight_tech)) %>%
  
  mutate(last_alight_tech = ifelse(!is.na(third_after_technology),
                                   third_after_technology,
                                   last_alight_tech))

…r ac_transit

DavidOry · 2018-11-01T22:56:01Z

Oh, I see. So the idea is that the "first boarding" and "last alighting" is important because it speaks to the way people access and egress transit, i.e., you walk to your first boarding and from your last alighting. So the purpose here is to understand you're first and last encounters with transit.

The transfer to/from is to understand the sequence of technologies, which speaks to the path label we give the movement in mode choice models, as well as the number of transfers.

Does that make sense?

jhelsel11 · 2018-11-01T23:04:07Z

Yes, I believe that makes sense, but won't this operation cause mismatches? E.g. suppose I am surveyed on Muni, I transfer next to BART and finally to a Tiburon ferry where I get off transit. The final record will be transfer_to = BART and last_alight_tech = "ferry", which doesn't make any sense.

DavidOry · 2018-11-02T12:14:06Z

This makes sense to me. The respondent did "transfer to" BART after being surveyed and the last technology they used in their path before egressing from transit was ferry.

The transfer_to and transfer_from variables provide an interesting summary of the other operators the surveyed operator engages with. And the first_board and last_alight variables provide insight into the way in which the traveler accessed/egressed transit (the surveyed operator in this case is not important).

Thoughts?

jhelsel11 · 2018-11-02T15:56:26Z

I think the way it's framed is fine. I just wanted to be sure that that was indeed what was intended when the code was written. When I worked through the code I expected the reported data to be symmetric and thought I'd see what purpose the code served.

jhelsel11 · 2018-11-02T16:15:15Z

I've finished my review for Build Standard Database for AC Transit. I pushed a couple of corrections in the crosswalk where the specified crosswalk values didn’t correspond with the values I was seeing in R. Other than our prior conversation on the transfer and tech conversation above I think it looks good now. I'm going to turn to coding the BART and Caltrain routes.

jhelsel11 · 2019-01-08T23:45:03Z

This branch is defunct.

Begin refactor of standard survey data creation script

d633275

Refactor standard db creation to method library for standardization.

a3f9458

DavidOry reviewed Oct 9, 2018

View reviewed changes

Update bespoke survey process

f433934

DavidOry reviewed Oct 9, 2018

View reviewed changes

Update to standard database code

5696fcf

Update build standard database

2fbb38d

Completed first pass through Build Standard Database

dc5c04a

jhelsel11 and others added 2 commits October 15, 2018 08:35

update standard database.

43a1822

renamed RDS output, fleshed out compare

497e8f4

DavidOry reviewed Oct 16, 2018

View reviewed changes

DavidOry and others added 4 commits October 24, 2018 15:32

start on ac transit to standard

b34f4ca

initial pass through dictionary

48e635f

Merge branch 'add-ac-transit-standard-REBASED' into refactor_survey

c14d8d7

# Conflicts: # decomposition/DecompositionAnalysis_SI.r # make-uniform/production/Build Standard Database.Rmd

put file reads in alphabetical order.

a0d9a39

jhelsel11 added 2 commits October 31, 2018 15:27

add AC transit to survey tech csv

12c1415

minor adjustments to order of code.

2f742f3

Fixed AC Transit worker code.

bdda105

jhelsel11 added 3 commits November 1, 2018 15:31

Update number_transfers_orig_board and number_transfer_alight_dest fo…

a33460f

…r ac_transit

remove shuttle from bespoke survey tech

b7f67d7

small changes to build standard database

5fe891a

jhelsel11 closed this Jan 8, 2019

jhelsel11 deleted the refactor_survey branch January 8, 2019 23:47

Refactor Standard Survey Creation #12

Refactor Standard Survey Creation #12

Conversation

jhelsel11 commented Oct 8, 2018

jhelsel11 commented Oct 8, 2018

jhelsel11 commented Oct 8, 2018

jhelsel11 commented Oct 8, 2018

jhelsel11 commented Oct 8, 2018

jhelsel11 commented Oct 8, 2018

jhelsel11 commented Oct 8, 2018

DavidOry commented Oct 9, 2018

DavidOry commented Oct 9, 2018

jhelsel11 commented Oct 9, 2018

jhelsel11 commented Oct 9, 2018

DavidOry Oct 9, 2018

Choose a reason for hiding this comment

DavidOry commented Oct 9, 2018 • edited Loading

DavidOry commented Oct 9, 2018

DavidOry commented Oct 9, 2018

DavidOry Oct 9, 2018

Choose a reason for hiding this comment

DavidOry Oct 9, 2018

Choose a reason for hiding this comment

DavidOry Oct 9, 2018

Choose a reason for hiding this comment

DavidOry commented Oct 9, 2018 • edited Loading

DavidOry Oct 9, 2018

Choose a reason for hiding this comment

jhelsel11 commented Oct 9, 2018

jhelsel11 commented Oct 9, 2018

jhelsel11 commented Oct 9, 2018

DavidOry commented Oct 10, 2018

jhelsel11 commented Oct 10, 2018

jhelsel11 commented Oct 10, 2018

DavidOry Oct 16, 2018

Choose a reason for hiding this comment

DavidOry Oct 16, 2018

Choose a reason for hiding this comment

jhelsel11 commented Oct 31, 2018

shimonisrael commented Oct 31, 2018

jhelsel11 commented Oct 31, 2018

jhelsel11 commented Nov 1, 2018

jhelsel11 commented Nov 1, 2018

jhelsel11 commented Nov 1, 2018

DavidOry commented Nov 1, 2018

jhelsel11 commented Nov 1, 2018 • edited Loading

DavidOry commented Nov 1, 2018

jhelsel11 commented Nov 1, 2018

DavidOry commented Nov 2, 2018 • edited Loading

jhelsel11 commented Nov 2, 2018

jhelsel11 commented Nov 2, 2018

jhelsel11 commented Jan 8, 2019

DavidOry commented Oct 9, 2018 •

edited

Loading

DavidOry commented Oct 9, 2018 •

edited

Loading

jhelsel11 commented Nov 1, 2018 •

edited

Loading

DavidOry commented Nov 2, 2018 •

edited

Loading