Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper needed when setting up standard survey entry #13

Open
DavidOry opened this issue Oct 9, 2018 · 13 comments
Open

Helper needed when setting up standard survey entry #13

DavidOry opened this issue Oct 9, 2018 · 13 comments
Assignees

Comments

@DavidOry
Copy link
Collaborator

DavidOry commented Oct 9, 2018

When adding an operator to the Standard Database Builder, a helper method or two is needed to make sure the dictionary file is complete. Previous approach was iterative and hand checked using code snippets.

@DavidOry
Copy link
Collaborator Author

DavidOry commented Oct 11, 2018

This code block should be removed when helper code is established.

ref_count <- survey_cat %>% 
  group_by(ID, operator, survey_year, survey_tech, survey_variable) %>% 
  summarise(count = n())

mult_ref_count  <- ref_count %>%
  filter(count > 1)

stopifnot(nrow(mult_ref_count) == 0)

@jhelsel11
Copy link
Collaborator

jhelsel11 commented Oct 12, 2018

@DavidOry, should the helper function assume that every variable in the survey needs to be in the standard database or should I only check that, if at least one level of a variable has been included in the Standard Database, then all levels are included?

I'm going to start with the second assumption (since there are so many unused variables) and can modify from there.

@jhelsel11
Copy link
Collaborator

In starting to code the second approach, I believe the current code is missing numerous variable levels.

For instance, in looking just at BART, we can see that there are 183 distinct combinations survey_variable and survey_response in the Dictionary for Standard Database.csv that are noncategorical. However, if we filter the the BART survey to look only at the survey_variables in the Dictionary, there are still 290 combinations.

The data dictionary is missing generic conversions of 92 actual survey responses. (e.g. 3 for BART_TICKET_CODE, 4 for HH_INCOME_CODE, etc.)

Thoughts?

@jhelsel11
Copy link
Collaborator

Look at the function check_dropped_variables for the source.

@DavidOry
Copy link
Collaborator Author

DavidOry commented Oct 16, 2018

In starting to code the second approach, I believe the current code is missing numerous variable levels.
For instance, in looking just at BART, we can see that there are 183 distinct combinations survey_variable and survey_response in the Dictionary for Standard Database.csv that are noncategorical. However, if we filter the the BART survey to look only at the survey_variables in the Dictionary, there are still 290 combinations.
The data dictionary is missing generic conversions of 92 actual survey responses. (e.g. 3 for BART_TICKET_CODE, 4 for HH_INCOME_CODE, etc.)
Thoughts?

I'm not sure I fully understand. Can you please flesh out a full example?

@DavidOry
Copy link
Collaborator Author

Okay. So what I think you're saying is that, for example, the dictionary has the following entries for BART_TICKET_CODE --> fare_category crosswalk:

BART | BART_TICKET_CODE | 1 | fare_category | adult
BART | BART_TICKET_CODE | 2 | fare_category | adult
BART | BART_TICKET_CODE | 3 | fare_category | senior
BART | BART_TICKET_CODE | 4 | fare_category | disabled
BART | BART_TICKET_CODE | 5 | fare_category | youth
BART | BART_TICKET_CODE | 6 | fare_category | student
BART | BART_TICKET_CODE | 7 | fare_category | adult

And the BART survey data also includes an entry for BART_TICKET_CODE = 102. So when we do the crosswalk we are coding the fare_category for records with BART_TICKET_CODE = 102 as NA. My memory -- which is fuzzy -- is that this is intentional. By trying to simplify the data, there are going to be categories that we leave behind. For example, BART employees ride free on BART (as on more transit systems) and the BART survey may have had an employee fare category. When we translate the BART data to the standard data, this response is effectively recoded as NA. Which seems right, as it allows the standard survey to remain tidy and, for regional analysis, not much information is lost. The details will remain in the BART survey itself. @shimonisrael: please let us know your thoughts on this if you disagree.

I do think you're helper method is useful. But it would be better to, instead of stopping the run, have the method return something useful, like the missing_variables dataframe.

I think we can wait on this, though, as we'll eventually have you enter an operator and at that time you'll be able to craft a handful of helper methods that will help you create the dictionary file.

@jhelsel11
Copy link
Collaborator

It's not that they are coded as missing. They are currently removed entirely from the resulting dataframe.

@shimonisrael
Copy link
Contributor

@DavidOry, I think my preference might be something like "Other", with NA reserved for missing variables. It would be helpful in quantifying variables that have responses, though unused ones, versus non-response values. Would there be a problem with this approach?

@DavidOry
Copy link
Collaborator Author

DavidOry commented Oct 17, 2018

It's not that they are coded as missing. They are currently removed entirely from the resulting dataframe.

What do you mean removed entirely?

@DavidOry
Copy link
Collaborator Author

@DavidOry, I think my preference might be something like "Other", with NA reserved for missing variables. It would be helpful in quantifying variables that have responses, though unused ones, versus non-response values. Would there be a problem with this approach?

That makes sense. This could be accomplished via the dictionary by explicitly mapping all of the entries to a "other" category. It would make the dictionary much larger, but would also make it a more faithful record of the translation from the survey to the standard database. I've added an Asana task for this.

@jhelsel11
Copy link
Collaborator

It's not that they are coded as missing. They are currently removed entirely from the resulting dataframe.

What do you mean removed entirely?

I was mistaken.

@jhelsel11
Copy link
Collaborator

@DavidOry Is this issue resolved to your satisfaction?

@DavidOry
Copy link
Collaborator Author

Let's leave it open. I'd like us to iterate an adding a few surveys and see what other functions we may need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants