-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Standard Survey Creation #12
Conversation
Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project? |
Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)? |
Q3) Is there a reason to keep so many copies of dataframes? (e.g. the progression for survey.BAR. ...) |
Q4) In the Caltrain survey there are survey IDs with letters, so the as.numeric coercion creates NA values. I'm leaving them as characters for now, but am unsure how you want this handled. (Roughly same problem for Muni.) |
Q5) Why did we track survey_tech if it gets dropped in survey_to_flat? |
Q6) How would you like me to handle the "Bespoke survey technology"? Put this into a csv? Leave it alone? |
What do you propose as the scope? One Rproject for the entire repository? |
Are you suggesting as a global setting? Or specific to each data read? |
I'm not sure I'd go that far. I just wanted to better understand why you selected the scope of the project you did. |
As a global option. (I've implemented that for now, but can remove it.) So far I haven't seen anything implementing a modeling approach that would use factors and I've found that factors are much more of a headache in overhead than just treating everything as a string. |
|
||
#### _ISSUES_ | ||
1. Work carefully through first few datasets, as the code is a bit coarse in places | ||
|
||
#### _TODO_ | ||
1. New users should add their username and relative file path to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhelsel11: Given that this database will be rather small, may it be easier just to list the user names and dirs here and build the dataframe rather than using a separate CSV?
The nature of working at an MPO. Where you can do a little bit for a few weeks, then stop, then start up six months later. So it was developed piecemeal, i.e., nothing intentional or thoughtful about the current design. I do encourage you to update the |
No good reason. I was likely getting familiar with |
Hmm. Not sure. But agree we should move ahead with |
filter(user == me) %>% | ||
.$path_switch | ||
|
||
f_spatial_to_be_geocoded_path = paste0(dir_path, "Data and Reports/_geocoding Standardized/spatial_places_to_be_geocoded.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you have Data and Reports
written over and over again, may it be better (i.e., easier to read) to assume this part of the path lives in dir_path
?
survey.VTA.select <- survey.VTA[ ,variables.VTA] | ||
survey_VTA <- read.csv(f_vta_survey_path, header = TRUE) | ||
|
||
# # JWH: I don't understand why this code is needed_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the data is being read in, it's, right now, a somewhat iterative and tedious process to make sure the dictionary includes all the variables you want it to include. So we should add a note here that we need some type of helper method like this when reading in a new survey. I'll create an issue for this and we can address it when we start using the code to set up new operators. For now, this code snippet can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue #13
It gets added back in:
|
mutate(survey_tech = ifelse(operator == "VTA" & str_count(route, pattern = fixed("902", ignore_case = TRUE)) > 0L, "light rail", survey_tech)) | ||
# There was a flag in the original conversion text for VTA route 185, but no | ||
# records had a route 185, so it has been dropped from the csv. | ||
survey_tech_update <- read.csv("bespoke_survey_tech_replacement.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend explicitly reading this in as characters (colClasses = 'character'
) given the mix of data types you're dealing with in the database.
Q7) The Standardize year born step (creation of survey_standard) implicitly changes Missing codes to NA. Do you want to continue with that implicit conversion or should that step be explicit? |
Q8) @DavidOry I think the recode key variables from NA to 'missing' step should be handled in 2 mutate_at steps, but I can't get the sytnax to work. Something like the following for the first step: R doesn't like the ifelse inside mutate_at, but I don't really understand why. |
Q9) I'm deep into reformatting the Build standard variables chunk and trying to keep lines under 80 characters. That's leading to its own weirdness. I think it would be helpful if @DavidOry took a look to see if that's a formatting you like or whether I should roll back those changes. |
(1) the 80 chars width should be broken if it improves readability. |
I guess the question is whether it degrades readability. I like having each clause on its own line, but it does require you to scroll more. |
As a more general question, should I take extra steps to ensure that there are no implicit NA conversions? |
@@ -108,19 +107,19 @@ f_vta_survey_path <- paste0(dir_path, | |||
"VTA/As CSV/VTA_DRAFTFINAL_20171114 NO POUND OR SINGLE QUOTE.csv") | |||
|
|||
f_output_rdata_path <- paste0(dir_path, | |||
"_data Standardized/survey_standard.RData") | |||
"_data Standardized/survey_standard.RDS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhelsel11: When using saveRDS, we should use the RDS extension, so users know that it's an RDS
file and not an Rdata
file.
|
||
table(thin_df$var_name) | ||
# update the Caltrain IDs and do again | ||
update_current_df <- current_df %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhelsel11: you are correct re: Caltrain bug. When the ID
s are updated to remove the leading "S", the previous and current data sets match.
# Conflicts: # decomposition/DecompositionAnalysis_SI.r # make-uniform/production/Build Standard Database.Rmd
@shimonisrael |
@jhelsel11 Yes, please. Those should all be express. If you could, please check for the free Broadway Shuttle in Oakland, which may be an exception as a lettered route. Otherwise all the lettered buses are express. Also, if I've not responded elsewhere to any questions, I'm a little behind so always feel free to email if I've overlooked anything. Thanks. |
@shimonisrael The broadway shuttle is coded as local. |
@DavidOry The vehicle and worker dictionaries seem unnecessary. 1) They include lots of values that aren't used in the dictionary (e.g. 'twelve') and 2) couldn't this just be done in the dictionary crosswalk? Hopefully I'm not misreading this, but I'll make the change after lunch. |
Never mind. After looking at the auto sufficiency code, I've decided to leave this alone for now. Still, it would be helpful to understand why the codes go up to "Six or more" and the numerical categories for the division only go up to 4. |
@DavidOry Why are the "transfer to and from" and "first boarding and last alighting technology" codes asymmetric? The transfer code looks as far back prior to boarding the surveyed route. The technology code looks as far to the end after the surveyed route. I don't believe there's anything wrong, I just don't understand why. |
I do not know. Can you point me to the code segment that you have concerns with? |
|
Oh, I see. So the idea is that the "first boarding" and "last alighting" is important because it speaks to the way people access and egress transit, i.e., you walk to your first boarding and from your last alighting. So the purpose here is to understand you're first and last encounters with transit. The transfer to/from is to understand the sequence of technologies, which speaks to the path label we give the movement in mode choice models, as well as the number of transfers. Does that make sense? |
Yes, I believe that makes sense, but won't this operation cause mismatches? E.g. suppose I am surveyed on Muni, I transfer next to BART and finally to a Tiburon ferry where I get off transit. The final record will be transfer_to = BART and last_alight_tech = "ferry", which doesn't make any sense. |
This makes sense to me. The respondent did "transfer to" BART after being surveyed and the last technology they used in their path before egressing from transit was ferry. The Thoughts? |
I think the way it's framed is fine. I just wanted to be sure that that was indeed what was intended when the code was written. When I worked through the code I expected the reported data to be symmetric and thought I'd see what purpose the code served. |
I've finished my review for Build Standard Database for AC Transit. I pushed a couple of corrections in the crosswalk where the specified crosswalk values didn’t correspond with the values I was seeing in R. Other than our prior conversation on the transfer and tech conversation above I think it looks good now. I'm going to turn to coding the BART and Caltrain routes. |
This branch is defunct. |
No description provided.