Skip to content

Imputation Standards Reference Guide

Drew C edited this page Aug 5, 2024 · 3 revisions

Purpose / Need

The purpose of imputation in the travel survey context is to maximize the volume of usable data, and the quality of that data. In general reported data, or derived data that is summarized directly from reported data, is preferable to imputed data. Imputation is used to:

  1. Fill in missing data. Users may neglect certain survey answers, or select a "prefer not to answer" option.
  2. Correct bad data. Users may make mistakes, for example selecting 'work' as the destination purpose for a work-to-home trip.
  3. Infer / categorize open-ended text responses.

The purpose of this document is to provide a working reference guide for task orders for consultant deliverables on imputation and data delivery.

Data items and use cases

Imputation should focus on fields that are important to the client agency's needs. Two common uses cases are:

  1. Travel demand model estimation and calibration
  2. Travel behavior analysis and statistics

The vendor and client should identify required fields during survey design. They should identify which fields must be reported directly (or summarized from reported data) and which fields may be appropriate to fill in or correct through imputation.

Example

This example uses the 2023 Bay Area Travel Survey expected to be used by SFCTA in model estimation and some common data analyses. Tables 1 and 2 together identify all of the fields of interest. This is provided as an example, not a definitive list, as needs may vary by agency and survey.

Table 1 identifies variables which have for which imputation may be appropriate. These variables have been imputed in past surveys, are similar to other variables that have been imputed, or seem relatively feasible to impute due to strong relationships to other survey variables.

Table 1: Travel Survey Fields from the 2023 BATS used by SFCTA for which imputation may be appropriate

table field used in model estimation used in analysis / travel statistics purpose justification
hh income_aggregate 1 1 fill in missing data imputed in past survey

hh | income_detailed | 1 | 1 | fill in missing data | imputed in past survey person | gender | 1 | 1 | fill in missing data | imputed in past survey person | age | 1 | 1 | fill in missing data | simple categorical with likely strong relationships with other person-level variables person | can_drive | 1 | 1 | fill in missing data | simple binary prediction person | transit_pass | 1 | 1 | fill in missing data | transit pass ownership sometimes modeled in travel demand models person | race_eth |   | 1 | fill in missing data | imputed in past survey trip | o_purpose_category | 1 | 1 | fill in missing data / error correction | imputed in past survey trip | d_purpose_category | 1 | 1 | fill in missing data / error correction | imputed in past survey trip | mode_type | 1 | 1 | fill in missing data / error correction | imputed in past survey trip | depart_time | 1 | 1 | correct for pickup time delay | imputed in past survey

Table 2 identifies the fields from the BATS 2023 survey for which imputation may not be appropriate. For fields that cannot be reasonably imputed, all due care should be taken to make sure they are reported by the participant.

Table 2: Travel Survey Fields from the 2023 BATS used by SFCTA for which imputation may NOT be appropriate

table field used in model estimation used in analysis / travel statistics justification
hh num_vehicles 1 1  
hh num_adults 1 1 Should be derivable from person-level data.
hh num_children 1 1 Should be derivable from person-level data
hh num_people 1 1 Should be derivable from person-level data
hh num_workers 1 1 Should be derivable from person-level data
hh home_lat 1 1 Difficult to distinguish primary from secondary home
hh home_lon 1 1 Difficult to distinguish primary from secondary home
person employment 1 1 No clear applicable imputation method.
person student 1 1 No clear applicable imputation method.
person telework_freq 1 1 No clear applicable imputation method.
person work_lat 1 1 Primary work location may be hard to identify for a person who goes to multiple locations for work.  Confounded by increased work-from-home.
person work_lon 1 1 Primary work location may be hard to identify for a person who goes to multiple locations for work.  Confounded by increased work-from-home.
person school_lat 1 1 Primary school location may be hard to identify for a person who goes to multiple locations for school.  Confounded by increased remote learning.
person school_lon 1 1 Primary school location may be hard to identify for a person who goes to multiple locations for school.  Confounded by increased remote learning.
person has_proxy 1   Should be derived from pipeline processing.  Whether a person is proxy-reported should be known due to survey method.
person work_park 1 1 No clear applicable imputation method.
day telecommute_time 1 1 No clear applicable imputation method.
trip o_lat 1 1 No clear applicable imputation method.
trip o_lon 1 1 No clear applicable imputation method.
trip d_lat 1 1 No clear applicable imputation method.
trip d_lon 1 1 No clear applicable imputation method.
trip arrive_time 1 1 Unlike depart_time, doesn't suffer from pick-up delay so it should ideally be directly measured

Data structure

A field may have both a reported and an imputed value, and it is important to distinguish between the two. To prevent overwriting of reported data and transparency of methods, imputed data fields should be labeled with the suffix _imputed. For example, if the trip mode is imputed then the data file should contain a field called mode with the user-reported mode and mode_imputed with the imputed mode. Optionally, a third field with suffix _final may be included to contain a preferred value between the reported and imputed values.

Documentation

The vendor and client should agree on documentation standards for imputation. This should include documentation of the methods used, imputation results, and requirements, if any, for the provision of scripts, code, or programs used to carry out imputation.

Methods

Methods may range from simple heuristic rules to complex probabilistic models. The documentation should describe the methods used in enough detail that staff could reasonably replicate the imputation.

Imputation Reports

Imputation reports should document the outcomes of the imputation process for each imputed field. These reports should include, at a minimum:

  1. Number and percent of values that were imputed.
  2. Number and percent of imputed values that differ from reported values.
  3. Reclassification matrix of reported to imputed values.
  4. Imputation confidence or probability, where appropriate

Scripts

If scripts or code are required, then the vendor and client should agree on the preferred method of transmission (ex. Github) and the preferred programming language (ex. Python or R), and whether these will be available to other agencies, stakeholders, or the public.

Clone this wiki locally