-
Notifications
You must be signed in to change notification settings - Fork 0
Imputation Standards Reference Guide
The purpose of imputation in the travel survey context is to maximize the volume of usable data, and the quality of that data. In general reported data, or derived data that is summarized directly from reported data, is preferable to imputed data. Imputation is used to:
- Fill in missing data. Users may neglect certain survey answers, or select a "prefer not to answer" option.
- Correct bad data. Users may make mistakes, for example selecting 'work' as the destination purpose for a work-to-home trip.
- Infer / categorize open-ended text responses.
The purpose of this document is to provide a working reference guide for task orders for consultant deliverables on imputation and data delivery.
Imputation should focus on fields that are important to the client agency's needs. Two common uses cases are:
- Travel demand model estimation and calibration
- Travel behavior analysis and statistics
The vendor and client should identify required fields during survey design. They should identify which fields must be reported directly (or summarized from reported data) and which fields may be appropriate to fill in or correct through imputation.
This example uses the 2023 Bay Area Travel Survey expected to be used by SFCTA in model estimation and some common data analyses. Tables 1 and 2 together identify all of the fields of interest. This is provided as an example, not a definitive list, as needs may vary by agency and survey.
Table 1 identifies variables which have for which imputation may be appropriate. These variables have been imputed in past surveys, are similar to other variables that have been imputed, or seem relatively feasible to impute due to strong relationships to other survey variables.
Table 1: Travel Survey Fields from the 2023 BATS used by SFCTA for which imputation may be appropriate
table field used in model estimation used in analysis / travel statistics purpose justification hh income_aggregate 1 1 fill in missing data imputed in past survey
hh | income_detailed | 1 | 1 | fill in missing data | imputed in past survey person | gender | 1 | 1 | fill in missing data | imputed in past survey person | age | 1 | 1 | fill in missing data | simple categorical with likely strong relationships with other person-level variables person | can_drive | 1 | 1 | fill in missing data | simple binary prediction person | transit_pass | 1 | 1 | fill in missing data | transit pass ownership sometimes modeled in travel demand models person | race_eth | | 1 | fill in missing data | imputed in past survey trip | o_purpose_category | 1 | 1 | fill in missing data / error correction | imputed in past survey trip | d_purpose_category | 1 | 1 | fill in missing data / error correction | imputed in past survey trip | mode_type | 1 | 1 | fill in missing data / error correction | imputed in past survey trip | depart_time | 1 | 1 | correct for pickup time delay | imputed in past survey
Table 2 identifies the fields from the BATS 2023 survey for which imputation may not be appropriate. For fields that cannot be reasonably imputed, all due care should be taken to make sure they are reported by the participant.
Table 2: Travel Survey Fields from the 2023 BATS used by SFCTA for which imputation may NOT be appropriate
table | field | used in model estimation | used in analysis / travel statistics | justification |
---|---|---|---|---|
hh | num_vehicles | 1 | 1 | |
hh | num_adults | 1 | 1 | Should be derivable from person-level data. |
hh | num_children | 1 | 1 | Should be derivable from person-level data |
hh | num_people | 1 | 1 | Should be derivable from person-level data |
hh | num_workers | 1 | 1 | Should be derivable from person-level data |
hh | home_lat | 1 | 1 | Difficult to distinguish primary from secondary home |
hh | home_lon | 1 | 1 | Difficult to distinguish primary from secondary home |
person | employment | 1 | 1 | No clear applicable imputation method. |
person | student | 1 | 1 | No clear applicable imputation method. |
person | telework_freq | 1 | 1 | No clear applicable imputation method. |
person | work_lat | 1 | 1 | Primary work location may be hard to identify for a person who goes to multiple locations for work. Confounded by increased work-from-home. |
person | work_lon | 1 | 1 | Primary work location may be hard to identify for a person who goes to multiple locations for work. Confounded by increased work-from-home. |
person | school_lat | 1 | 1 | Primary school location may be hard to identify for a person who goes to multiple locations for school. Confounded by increased remote learning. |
person | school_lon | 1 | 1 | Primary school location may be hard to identify for a person who goes to multiple locations for school. Confounded by increased remote learning. |
person | has_proxy | 1 | Should be derived from pipeline processing. Whether a person is proxy-reported should be known due to survey method. | |
person | work_park | 1 | 1 | No clear applicable imputation method. |
day | telecommute_time | 1 | 1 | No clear applicable imputation method. |
trip | o_lat | 1 | 1 | No clear applicable imputation method. |
trip | o_lon | 1 | 1 | No clear applicable imputation method. |
trip | d_lat | 1 | 1 | No clear applicable imputation method. |
trip | d_lon | 1 | 1 | No clear applicable imputation method. |
trip | arrive_time | 1 | 1 | Unlike depart_time, doesn't suffer from pick-up delay so it should ideally be directly measured |
A field may have both a reported and an imputed value, and it is important to distinguish between the two. To prevent overwriting of reported data and transparency of methods, imputed data fields should be labeled with the suffix _imputed
. For example, if the trip mode is imputed then the data file should contain a field called mode
with the user-reported mode and mode_imputed
with the imputed mode. Optionally, a third field with suffix _final
may be included to contain a preferred value between the reported and imputed values.
The vendor and client should agree on documentation standards for imputation. This should include documentation of the methods used, imputation results, and requirements, if any, for the provision of scripts, code, or programs used to carry out imputation.
Methods may range from simple heuristic rules to complex probabilistic models. The documentation should describe the methods used in enough detail that staff could reasonably replicate the imputation.
Imputation reports should document the outcomes of the imputation process for each imputed field. These reports should include, at a minimum:
- Number and percent of values that were imputed.
- Number and percent of imputed values that differ from reported values.
- Reclassification matrix of reported to imputed values.
- Imputation confidence or probability, where appropriate
If scripts or code are required, then the vendor and client should agree on the preferred method of transmission (ex. Github) and the preferred programming language (ex. Python or R), and whether these will be available to other agencies, stakeholders, or the public.