Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Standard Survey Creation #12

Closed
wants to merge 30 commits into from
Closed

Conversation

jhelsel11
Copy link
Collaborator

No description provided.

@jhelsel11
Copy link
Collaborator Author

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project?

@jhelsel11
Copy link
Collaborator Author

Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)?

@jhelsel11
Copy link
Collaborator Author

Q3) Is there a reason to keep so many copies of dataframes? (e.g. the progression for survey.BAR. ...)

@jhelsel11
Copy link
Collaborator Author

Q4) In the Caltrain survey there are survey IDs with letters, so the as.numeric coercion creates NA values. I'm leaving them as characters for now, but am unsure how you want this handled. (Roughly same problem for Muni.)

@jhelsel11
Copy link
Collaborator Author

Q5) Why did we track survey_tech if it gets dropped in survey_to_flat?

@jhelsel11
Copy link
Collaborator Author

Q6) How would you like me to handle the "Bespoke survey technology"? Put this into a csv? Leave it alone?

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project?

What do you propose as the scope? One Rproject for the entire repository?

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)?

Are you suggesting as a global setting? Or specific to each data read?

@jhelsel11
Copy link
Collaborator Author

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project?

What do you propose as the scope? One Rproject for the entire repository?

I'm not sure I'd go that far. I just wanted to better understand why you selected the scope of the project you did.

@jhelsel11
Copy link
Collaborator Author

Q2) Is there a reason to use factors rather than options(stringsAsFactors = FALSE)?

Are you suggesting as a global setting? Or specific to each data read?

As a global option. (I've implemented that for now, but can remove it.) So far I haven't seen anything implementing a modeling approach that would use factors and I've found that factors are much more of a headache in overhead than just treating everything as a string.


#### _ISSUES_
1. Work carefully through first few datasets, as the code is a bit coarse in places

#### _TODO_
1. New users should add their username and relative file path to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhelsel11: Given that this database will be rather small, may it be easier just to list the user names and dirs here and build the dataframe rather than using a separate CSV?

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

Q1) Is there a reason to keep the distributed Rproject files rather than consolidating into a master project? What do you propose as the scope? One Rproject for the entire repository? I'm not sure I'd go that far. I just wanted to better understand why you selected the scope of the project you did.
--

The nature of working at an MPO. Where you can do a little bit for a few weeks, then stop, then start up six months later. So it was developed piecemeal, i.e., nothing intentional or thoughtful about the current design. I do encourage you to update the Rproject files so that they contain a logical scope, i.e., navigating around in RStudio makes sense, the working dir makes sense, etc.

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

      Q3) Is there a reason to keep so many copies of dataframes? (e.g. the progression for survey.BAR. ...)

No good reason. I was likely getting familiar with dplyr at the time and was exploring ways to use it. It does allow for easier debugging, which is less important moving forward (i.e., the code has been used a lot more now).

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

      Q4) In the Caltrain survey there are survey IDs with letters, so the as.numeric coercion creates NA values. I'm leaving them as characters for now, but am unsure how you want this handled. (Roughly same problem for Muni.)

Hmm. Not sure. But agree we should move ahead with characters. Likely just a bug.

filter(user == me) %>%
.$path_switch

f_spatial_to_be_geocoded_path = paste0(dir_path, "Data and Reports/_geocoding Standardized/spatial_places_to_be_geocoded.csv")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you have Data and Reports written over and over again, may it be better (i.e., easier to read) to assume this part of the path lives in dir_path?

survey.VTA.select <- survey.VTA[ ,variables.VTA]
survey_VTA <- read.csv(f_vta_survey_path, header = TRUE)

# # JWH: I don't understand why this code is needed_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the data is being read in, it's, right now, a somewhat iterative and tedious process to make sure the dictionary includes all the variables you want it to include. So we should add a note here that we need some type of helper method like this when reading in a new survey. I'll create an issue for this and we can address it when we start using the code to set up new operators. For now, this code snippet can be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #13

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

      Q5) Why did we track survey_tech if it gets dropped in survey_to_flat?

It gets added back in:

survey_adds <- survey_combine %>%
  mutate(survey_year = as.character(survey_year)) %>%
  group_by(ID, operator, survey_year, survey_tech) %>%
  summarise(count = n()) %>%
  select(-count)

survey_flat <- survey_flat %>%
  left_join(survey_adds, by = c("ID", "operator", "survey_year"))

mutate(survey_tech = ifelse(operator == "VTA" & str_count(route, pattern = fixed("902", ignore_case = TRUE)) > 0L, "light rail", survey_tech))
# There was a flag in the original conversion text for VTA route 185, but no
# records had a route 185, so it has been dropped from the csv.
survey_tech_update <- read.csv("bespoke_survey_tech_replacement.csv")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend explicitly reading this in as characters (colClasses = 'character') given the mix of data types you're dealing with in the database.

@jhelsel11
Copy link
Collaborator Author

Q7) The Standardize year born step (creation of survey_standard) implicitly changes Missing codes to NA. Do you want to continue with that implicit conversion or should that step be explicit?

@jhelsel11
Copy link
Collaborator Author

Q8) @DavidOry I think the recode key variables from NA to 'missing' step should be handled in 2 mutate_at steps, but I can't get the sytnax to work.

Something like the following for the first step:
survey_standard <- survey_standard %>%
mutate_at(c("orig_purp", "dest_purp", "work_status", "student_status", "approximate_age"),
~ifelse(is.na(.), "missing", .)

R doesn't like the ifelse inside mutate_at, but I don't really understand why.

@jhelsel11
Copy link
Collaborator Author

Q9) I'm deep into reformatting the Build standard variables chunk and trying to keep lines under 80 characters. That's leading to its own weirdness. I think it would be helpful if @DavidOry took a look to see if that's a formatting you like or whether I should roll back those changes.

@DavidOry
Copy link
Collaborator

(1) the 80 chars width should be broken if it improves readability.
(2) i'll take a look at the mutate_at later today.

@jhelsel11
Copy link
Collaborator Author

(1) the 80 chars width should be broken if it improves readability.
(2) i'll take a look at the mutate_at later today.

I guess the question is whether it degrades readability. I like having each clause on its own line, but it does require you to scroll more.

@jhelsel11
Copy link
Collaborator Author

Q7) The Standardize year born step (creation of survey_standard) implicitly changes Missing codes to NA. Do you want to continue with that implicit conversion or should that step be explicit?

As a more general question, should I take extra steps to ensure that there are no implicit NA conversions?

@@ -108,19 +107,19 @@ f_vta_survey_path <- paste0(dir_path,
"VTA/As CSV/VTA_DRAFTFINAL_20171114 NO POUND OR SINGLE QUOTE.csv")

f_output_rdata_path <- paste0(dir_path,
"_data Standardized/survey_standard.RData")
"_data Standardized/survey_standard.RDS")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhelsel11: When using saveRDS, we should use the RDS extension, so users know that it's an RDS file and not an Rdata file.


table(thin_df$var_name)
# update the Caltrain IDs and do again
update_current_df <- current_df %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhelsel11: you are correct re: Caltrain bug. When the IDs are updated to remove the leading "S", the previous and current data sets match.

@jhelsel11
Copy link
Collaborator Author

@shimonisrael
Would you consider the AC Transit Transbay buses express buses or local buses? Dave has asked me to make sure the AC Transit information is running properly and I'm taking a look at the bespoke survey tech file. My thought was to treat all lettered buses as express (since they cross the bay)?

@shimonisrael
Copy link
Contributor

@shimonisrael
Would you consider the AC Transit Transbay buses express buses or local buses? Dave has asked me to make sure the AC Transit information is running properly and I'm taking a look at the bespoke survey tech file. My thought was to treat all lettered buses as express (since they cross the bay)?

@jhelsel11 Yes, please. Those should all be express. If you could, please check for the free Broadway Shuttle in Oakland, which may be an exception as a lettered route. Otherwise all the lettered buses are express. Also, if I've not responded elsewhere to any questions, I'm a little behind so always feel free to email if I've overlooked anything. Thanks.

@jhelsel11
Copy link
Collaborator Author

@shimonisrael The broadway shuttle is coded as local.

@jhelsel11
Copy link
Collaborator Author

@DavidOry The vehicle and worker dictionaries seem unnecessary. 1) They include lots of values that aren't used in the dictionary (e.g. 'twelve') and 2) couldn't this just be done in the dictionary crosswalk?

Hopefully I'm not misreading this, but I'll make the change after lunch.

@jhelsel11
Copy link
Collaborator Author

@DavidOry The vehicle and worker dictionaries seem unnecessary. 1) They include lots of values that aren't used in the dictionary (e.g. 'twelve') and 2) couldn't this just be done in the dictionary crosswalk?

Hopefully I'm not misreading this, but I'll make the change after lunch.

Never mind. After looking at the auto sufficiency code, I've decided to leave this alone for now. Still, it would be helpful to understand why the codes go up to "Six or more" and the numerical categories for the division only go up to 4.

@jhelsel11
Copy link
Collaborator Author

@DavidOry Why are the "transfer to and from" and "first boarding and last alighting technology" codes asymmetric? The transfer code looks as far back prior to boarding the surveyed route. The technology code looks as far to the end after the surveyed route. I don't believe there's anything wrong, I just don't understand why.

@DavidOry
Copy link
Collaborator

DavidOry commented Nov 1, 2018

@DavidOry Why are the "transfer to and from" and "first boarding and last alighting technology" codes asymmetric? The transfer code looks as far back prior to boarding the surveyed route. The technology code looks as far to the end after the surveyed route. I don't believe there's anything wrong, I just don't understand why.

I do not know. Can you point me to the code segment that you have concerns with?

@jhelsel11
Copy link
Collaborator Author

jhelsel11 commented Nov 1, 2018

# Step 6:  Travel Model One path details
# -------------------------------------------------------------------------------------

# Transfer to and from
survey_standard <- survey_standard %>%
  mutate(transfer_from = as.character(first_before_operator)) %>%
  mutate(transfer_from = ifelse(second_before_operator != 'None', 
                                as.character(second_before_operator), 
                                transfer_from)) %>%
  
  mutate(transfer_from = ifelse(third_before_operator  != 'None', 
                                as.character(third_before_operator),
                                transfer_from)) %>%
  
  mutate(transfer_to = as.character(first_after_operator))

# First boarding and last alighting technology
survey_standard <- survey_standard %>%
  mutate(first_board_tech = as.character(survey_tech)) %>%
  mutate(first_board_tech = ifelse(!is.na(first_before_technology), 
                                   first_before_technology, 
                                   first_board_tech)) %>%
  
  mutate(last_alight_tech = as.character(survey_tech)) %>%
  mutate(last_alight_tech = ifelse(!is.na(first_after_technology),
                                   first_after_technology,
                                   last_alight_tech)) %>%
  
  mutate(last_alight_tech = ifelse(!is.na(second_after_technology), 
                                   second_after_technology, 
                                   last_alight_tech)) %>%
  
  mutate(last_alight_tech = ifelse(!is.na(third_after_technology),
                                   third_after_technology,
                                   last_alight_tech))

@DavidOry
Copy link
Collaborator

DavidOry commented Nov 1, 2018

Oh, I see. So the idea is that the "first boarding" and "last alighting" is important because it speaks to the way people access and egress transit, i.e., you walk to your first boarding and from your last alighting. So the purpose here is to understand you're first and last encounters with transit.

The transfer to/from is to understand the sequence of technologies, which speaks to the path label we give the movement in mode choice models, as well as the number of transfers.

Does that make sense?

@jhelsel11
Copy link
Collaborator Author

Yes, I believe that makes sense, but won't this operation cause mismatches? E.g. suppose I am surveyed on Muni, I transfer next to BART and finally to a Tiburon ferry where I get off transit. The final record will be transfer_to = BART and last_alight_tech = "ferry", which doesn't make any sense.

@DavidOry
Copy link
Collaborator

DavidOry commented Nov 2, 2018

This makes sense to me. The respondent did "transfer to" BART after being surveyed and the last technology they used in their path before egressing from transit was ferry.

The transfer_to and transfer_from variables provide an interesting summary of the other operators the surveyed operator engages with. And the first_board and last_alight variables provide insight into the way in which the traveler accessed/egressed transit (the surveyed operator in this case is not important).

Thoughts?

@jhelsel11
Copy link
Collaborator Author

I think the way it's framed is fine. I just wanted to be sure that that was indeed what was intended when the code was written. When I worked through the code I expected the reported data to be symmetric and thought I'd see what purpose the code served.

@jhelsel11
Copy link
Collaborator Author

I've finished my review for Build Standard Database for AC Transit. I pushed a couple of corrections in the crosswalk where the specified crosswalk values didn’t correspond with the values I was seeing in R. Other than our prior conversation on the transfer and tech conversation above I think it looks good now. I'm going to turn to coding the BART and Caltrain routes.

@jhelsel11
Copy link
Collaborator Author

This branch is defunct.

@jhelsel11 jhelsel11 closed this Jan 8, 2019
@jhelsel11 jhelsel11 deleted the refactor_survey branch January 8, 2019 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants