02_tripgen.Rmd

# Trip Generation {#chap-tripgen}

The purpose of the trip generation model is to turn socioeconomic data into a
certain number of trips. Every trip has two ends: an *origin* and a *destination*.
But because we do not yet know where trips are going, at this stage of the model
we instead forecast trip **productions** and trip **attractions**.

Imagine a basic day where you travel from home to work, and then back to home.
You made two trips. The first trip originated at your home and was destined for
work: your second trip originated at work and was destined for home. But your
home *produced* two trips, and your workplace *attracted* two trips.

This distinction is critical. In a trip production model, we use the attributes 
of households to figure out how many trips each zone produces. In a trip attraction
model, we use socioeconomic data to determine how many trips are attracted to the
zone.   It would be oversimplifying to say that all trip *production* happens at
households: businesses produce commercial trips, and households can also attract
trips from other households. But in this class, we are mostly going to concern
ourselves with household trip production.

Trip generation models are separated by **trip purpose**; trip purposes are
defined by the kind of activity occurring at the production and attraction 
ends of a trip. Common trip purposes include:

  - *Home-based work* (HBW): trips produced at a home and attracted to a work place.
  - *Home-based school / college* (HBSchool): trips produced at a home and attracted
  to a school or college.
  - *Home-based shopping / recreational* (HBShop): trips produced at a home and
  attracted to a shopping or recreational activity.
  - *Home-based other* (HBO): trips produced at a home and attracted to any
  activity not-defined above.
  - *Non Home-based* (NHB): trips produced somewhere other than a home and
  attracted to anywhere.
  
The specific purposes included in a model are a function of the data available
and the needs of the region. If there is no university or college in a region,
then the HBSchool purpose might be sufficient; otherwise, there might need
to be a HBSchool and a HBCollege. Or, if there is no data on trips by students,
this purpose might just get folded into HBO.

The next two sections give details of how to construct a cross-classification
model of trip production as well as a regression model of trip attractions.
More details of trip Generation models are given in Section 4.4 of 
*NCHRP Report 716*.


## Trip Production

The trip production model is a cross-classification model. What this means
is that we will classify households into different bins based on their 
household size, income, vehicle ownership, etc. We will then calculate the
average number of trips made by households in each group.

We will need to get the data for households and trips. Use the records for
households in MSA size `02`, that completed the survey on a weekday, and then
filter the trips to include only those records. We can also select only the data
columns that have the information we will use to classify the model.  We might
need to create some or modify variables that we need to use to cross-classify;
for instance we should cap the household size category at 4 people and the 
vehicles at 3.

```{r tp_load, message=FALSE, warning=FALSE}
library(tidyverse)
library(nhts2017)

hh <- nhts_households %>% 
  # filter to MSA size 2, travel on weekday
  filter(msasize == "02", !travday %in% c("01", "07")) %>%
  # select the columns we care about.
  select(houseid, wthhfin, hhsize, hhvehcnt, numadlt, hhfaminc, wrkcount) %>%
  mutate(
    hhsize = ifelse(hhsize > 4, 4, hhsize),
    hhvehcnt = ifelse(hhvehcnt > 3, 3, hhvehcnt),
    wrkcount = ifelse(wrkcount > 2, 2, wrkcount)
  )
hh
```

The next step is we need to calculate how many trips the members of each
household in the data took. To do this, we can use `summarise` to count the number of
trip rows for each household. Then, we can `pivot_wider` to spread the trips
out by purpose.

```{r tp_trip}
trips <- nhts_trips %>% 
  # filter to households in the data
  filter(houseid %in% hh$houseid) %>%
  group_by(houseid, trippurp) %>%
  # count up how many trips each household took
  summarise(trips = n()) %>%
  # "spread" the data, filling zero if no trips were taken
  pivot_wider(id_cols = houseid, names_from = trippurp, 
              values_from = trips, values_fill = 0)
trips
```

Now, we will `join` the trips data frame to the households data frame so that
everything is in one place. Note that when we do this, there will be some households
that never made any trips; we need to change their trip counts from `NA` to `0`.


```{r tp_hh_file}
# function to change NA to 0
nato0 <- function(x) {ifelse(is.na(x), 0, x)}

tripprod <- hh %>% 
  # join tables by id field
  left_join(trips, by = "houseid") %>%
  # change all NA values in columns from the trips data to 0.a
  mutate_at(vars(names(trips)), nato0)
tripprod
```

Now we can count up the number of trips by grouping the variables we care 
about and taking the average. For instance, we can get the average HBO trip rate
for households by size and vehicle count. Remember to weight!

```{r hbo_tripprod}
hbo_tripprod <- tripprod %>%
  group_by(hhsize, hhvehcnt) %>%
  summarise(
    n = n(), # number of households in category
    HBO = weighted.mean(HBO, wthhfin), # average HBO trips per hh
  )
hbo_tripprod
```


## Trip Attraction

Trip attraction models estimate how many trips will be attracted to a particular
zone. This is a function of how many jobs of different kinds are in a zone, in
addition to other elements of a zone. Trip attraction models are often a linear
regression model.

The NHTS cannot be used to estimate trip attraction models because we do not 
know how many trips went to each TAZ; we only see the household side of the
survey. So we will use a file that I have prepared from the Puget Sound Regional 
Council (PSRC, Seattle) household travel survey. This file is available [on Box](https://byu.box.com/shared/static/7ci8vomip719bdno7xl5ftjj940dausm.csv). You
can download it and read it directly into an R session with the `read_csv()` function,

```{r load_attractions}
psrc_attractions <- read_csv("https://byu.box.com/shared/static/7ci8vomip719bdno7xl5ftjj940dausm.csv")

psrc_attractions
```

This file has, for every tract in the Seattle metro region, how many trips were
attracted to the tract by purpose as well as the households and jobs by type in
that tract. Let's look at the relationship between HBW trips and total
employment:

```{r totemp-hbw}
ggplot(psrc_attractions, aes(x = totemp, y = HBW)) + 
  geom_point() + stat_smooth(method = "lm")
```

We can estimate a linear regression model with the `lm` function. In this function
we specify the model as `y ~ x + ...`.

```{r lm-totemp-hbw}
lm_hbw_totemp <- lm(HBW ~ totemp, data = psrc_attractions)
summary(lm_hbw_totemp)
```

Let's estimate a more complex for HBO trips with many predictors.

```{r hbo_rates}
hbo_rates <- lm(HBO ~ tothh + retl + manu + offi + gved + othr, 
                data = psrc_attractions)
summary(hbo_rates)
```

The $R^2$ statistic is not particularly good, but it really never will be with
this kind of data. More important is the relationship with the residuals. It's
also not very good; there are a few outliers and quite a bit of
heteroskedasticity. But there may be things we can try to make it better.

```{r fitted}
plot(hbo_rates, which = 1)
```


## Homework {-#hw-generation}

1. Calculate the trip rates for each purpose by household size, and by income
group. Do the rates make sense? Why or why not? 

1. Calculate the trip rates for each purpose by the number of household workers
and the vehicle availability. Do the rates make sense? Why or why not?

1. Calculate the *variance* or *standard deviation* in work trip rates by
household size / vehicles and by number of workers / vehicles. Which
classification should be used for work trips?

1.  Calculate the number of households in each classification (size /
vehicles), (workers / vehicles). What information does this give you about the
estimated trip rates?

1. Estimate trip rate attraction models for all the trip purposes.
Present models with only significant or influential factors (try a few different
specifications until you are satisfied with your models' performance)

1. Explain your attraction rate models; do the rates make sense? Which
models have the best fit in terms of $R^2$ value? Why?

1. Remove the intercept from your model estimations. In R, you can do
this by adding a `-1` to the formula, as in `lm(y ~ x - 1)`. Do the rates
change? By how much? Should you keep the intercept in or remove it?

## Lab  {-#lab-generation}

In this lab you will implement and calibrate trip generation rates for the RVTPO
model.


### Trip Production

The trip production rates are stored in the `params/trip_prod/` folder, with a
dbf file for each trip purpose in the model. We are going to calibrate the
following trip purposes:

  - `HBW`: cross-classification model of workers and vehicles available
  - `HBO`: cross-classification model of household persons and vehicles available
  - `HBShop`: cross-classification model of household persons and vehicles available
  
The household travel survey for the RVTPO region reported the following total
weighted trips in these trip purposes:

| Purpose    |   Weighted Survey Trips   |
|:-----------|--------------------------:|
|HBW         |          118,653          |
|HBO         |          267,987          |
|HBShop      |          129,614          |

Begin by running the RVTPO model through the Trip Generation
submodel. The trip productions for each trip purpose are recorded in
`Base/outputs/HH_PROD.dbf`. Using this file, create a report that sums the trips
produced in each of these three purposes. You can read this file in R using
the `read.dbf()` function in the `foreign` library, and then sum all columns in
this table using the `summarize_all` function.

```{r hhprod}
# read roanoke household trip productions
rvtpo_productions <- foreign::read.dbf("data/HH_PROD.DBF") %>%
  as_tibble()
# show first 10 rows
rvtpo_productions
rvtpo_productions %>%
  summarize_all( sum )
```

Obviously the bare model is not very close to the targets. Replace the bare
model rates with rates you estimated from the NHTS during your homework
assignment. Run your tabulation report again, and compare the total productions
to the regional targets.

Adjust the trip rates so that they replicate the regional survey targets within
an acceptable margin of error. Ensure that the comparative relationship between
the trip rates is maintained (i.e. households with more workers must make at
least as many work trips).

### Trip Attraction

The trip attraction rates enter into the destination choice model, but it is
helpful to compare the total forecasted attractions using the rates you
estimated in your regression models. Apply the rates to the zonal socioeconomic data file and calculate
the total number of attracted trips for each of the three purposes. Adjust the
rates so that the total attractions for your three estimated purposes match the
regional targets above.

### Report

Prepare a technical report describing the process by which you estimated
household trip production and attraction rates, and calibrated the trip rates to
reproduce regional totals. Calculate the margin of error from the targets for each purpose. Discuss how well the calibrated trip
generation/attraction models replicate the survey targets, and justify your residual
error. Use formatted tables to display results instead of screenshots of output.

You will be graded on the overall readability, flow, formatting and grammar in
addition to how clearly you articulate the process of your work.