07-linear-regression.Rmd

# Linear Regression {#linear-regression}

```{r ch7-setup, include=FALSE}
# Set global R options
options(scipen = 999)

# Set the graphical theme
ggplot2::theme_set(ggplot2::theme_light())

# Set global knitr chunk options
knitr::opts_chunk$set(
  cache = FALSE,
  warning = FALSE, 
  message = FALSE, 
  collapse = TRUE, 
  fig.align = "center"
)
```

Linear regression, a staple of statistical modeling from the precomputer age of statistics, is one of the simplest algorithms for supervised learning. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters, linear regression is still a useful and widely applied statistical learning method. Moreover, it serves as a good jumping-off point for newer approaches; as we will see in later chapters, many fancy statistical learning approaches can be seen as generalizations to or extensions of linear regression. Consequently, it is importance to have a good understanding of linear regression before studying more complex learning methods. This chapter introduces linear regression with an emphasis on predictiction, rather than inference. An excellent and comprehensive overview of linear regression is provided in @kutner-2005-applied. See @faraway-2016-linear for a discussion of linear regression in R (the book's website also provides Python scripts).


## Prerequisites

In this chapter, we'll make use of the following packages:

```{r 07-pkgs, message=FALSE}
library(caret)    # for cross-validation and model training functions
library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics
library(rsample)  # for data splitting
library(vip)      # for variable importance plots
library(modelr)
```

To illustrate linear regression concepts, we'll continue with the Ames housing data, where our intent is to predict `Sale_Price` (and `log(Sale_Price)`). As discussed in the _Data splitting_ section \@ref(reg-perf-split), we'll set aside part of our data for training and another part to assess generalizability error.

```{r 07-data-import}
# Construct train (70%) and test (30%) sets Ames housing data.
ames <- AmesHousing::make_ames()  # Ames housing data
set.seed(123)  # for reproducibility
ames_split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
trn <- training(ames_split)  # training data
tst  <- testing(ames_split)  # test data
```


## Simple linear regression

In Section \@ref(correlation) we discussed the use of Pearson's correlation coefficient to quantify the strength of the linear association between two continuous variables. In this section, we seek to fully characterize that linear relationship. _Simple linear regression_ (SLR) assumes that the statistical relationship between two continuous variables is (at least approximately) linear:

\begin{equation}
(\#eq:lm)
  Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad \text{for } i = 1, 2, \dots, n,
\end{equation}

where $Y_i$ represents the _i_-th response value, $X_i$ represents the _i_-th feature value, $\beta_0$ and $\beta_1$ are fixed, but unknown constants (commonly referred to as coefficients or parameters) that represent the intercept and slope of the regression line, respectively, and $\epsilon_i$ represent noise or random error. In this Chapter, we'll assume that the errors are normally distirbuted with mean zero and constant variance $\sigma^2$, denoted $\stackrel{iid}{\sim} \left(0, \sigma^2\right)$. Since the random errors are centered around zero (i.e., $E\left(\epsilon\right) = 0$), linear regression is really a problem of estimating a _conditional mean_:

\begin{equation}
  E\left(Y_i | X_i\right) = \beta_0 + \beta_1 X_i.
\end{equation}

For brevity, we often drop the conditional pice and write $E\left(Y | X\right) = E\left(Y\right)$. Consequently, the interpretation of the coefficients are in terms of the average, or mean reponse. For example, the intercept $\beta_0$ represents the average response value when $X = 0$ (it is often not meaningful or of interest and is is sometimes referred to as a _bias term_). The slope $\beta_1$ represents the increase in the average response per one-unit increase in $X$ (i.e., it is a _rate of change_).


### Estimation

Ideally, we want estimates of $\beta_0$ and $\beta_1$ that give us the "best fitting" line. But what is meant by "best fitting"? The most common approach is to use the method of _least squares_ (LS) estimation; this form of linear regression is often referred to ordinary least squares (OLS) regression. There are multiple ways to measure "best fitting", but the LS criterion finds the "best fitting" line by minimizing the _residual sum of squares_ (RSS):

\begin{equation}
(\#eq:least-squares-simple)
  RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left[Y_i - \left(\beta_0 + \beta_1 X_i\right)\right]^2 = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2.
\end{equation}

The LS estimated of $\beta_0$ and $\beta_1$ are denoted as $\widehat{\beta}_0$ and $\widehat{\beta}_1$, respectively. Once obtained, we can generate predicted values, say at $X = X_{new}$, using the estimaed regression equation:

\begin{equation}
  \widehat{Y}_{new} = \widehat{\beta}_0 + \widehat{\beta}_1 X_{new},
\end{equation}

where $\widehat{Y}_{new} = \widehat{E\left(Y_{new} | X = X_{new}\right)}$ is the estimated mean response at $X = X_{new}$.

With the Ames housing data, suppose we wanted to model a linear relationship between the year the house was built (`Year_Built`) and sale price (`Sale_Price`). To perform an OLS regression model in R we can use the `lm()` function: 

```{r 07-model1}
model1 <- lm(Sale_Price ~ Gr_Liv_Area, data = trn)
```

The fitted model (`model1`) is displayed in the left plot in Figure \@ref(fig:07-visualize-model1) where the points represent the values of `Sale_Price` in the training data. In the right plot of Figure  \@ref(fig:07-visualize-model1), the vertical lines represent the individual errors, called _residuals_, associated with each observation. The OLS criterion \@ref(eq:rss) identifies the "best fitting" line that minimizes the sum of squares of these residuals.

```{r 07-visualize-model1, eval=TRUE, fig.width=10, fig.height=3.5, echo=FALSE, fig.cap="The least squares fit from regressing `Sale_Price` on `Gr_Liv_Area` for the the Ames housing data. _Left_: Fitted regresison line. _Right_: Fitted regression line with vertical grey bars representing the residuals."}
# Fitted regression line (full training data)
p1 <- model1 %>%
  broom::augment() %>%
  ggplot(aes(Gr_Liv_Area, Sale_Price)) + 
  geom_point(size = 1, alpha = 0.3) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_y_continuous(labels = scales::dollar) +
  ggtitle("Fitted regression line")

# Fitted regression line (restricted range)
p2 <- model1 %>%
  broom::augment() %>%
  ggplot(aes(Gr_Liv_Area, Sale_Price)) + 
  geom_segment(aes(x = Gr_Liv_Area, y = Sale_Price,
                   xend = Gr_Liv_Area, yend = .fitted), 
               alpha = 0.3) +
  geom_point(size = 1, alpha = 0.3) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_y_continuous(labels = scales::dollar) +
  ggtitle("Fitted regression line (with residuals)")

# Side-by-side plots
grid.arrange(p1, p2, nrow = 1)
```

The `coef()` function extracts the estimated coefficientes from the model. We can also use `summary()` to get a more detailed report of the model results. 

```{r 07-model1-summary}
summary(model1)
```

The estimated coefficients from our model are $\widehat{\beta}_0 =$ `r round(coef(model1)[1L], digits = 2)` and $\widehat{\beta}_1 =$ `r round(coef(model1)[2L], digits = 2)`. To interpret, we estimate that the mean selling price increases by `r round(coef(model1)[2L], digits = 2)` for each additional one square foot of above ground living space. This simple description of the relationship between the sale price and square footage using a single number (i.e., the slope) is what makes linear regression such an intuitive and popular modeling tool.

One drawback of the LS procedure in linear regression is that it only provides estimates of the coefficents; it does not provide an estimate of the error variance $\sigma^2$! Note that LS makes no assumptions about the random errors. These assumptions are important for inference and in estimating the error variance which we're assuming is constant with a value of $\sigma^2$. One way to estimate $\sigma^2$ (which is required for characterizing the variability of our fitted model), is to use the method of _maximum likelihood_ (ML) estimation (see [@kutner-2005-applied sec 1.7] for details). The ML procedure requires that we assume a particular distribution for the random errors. Most often, we assume the errors to be normally distributed. In practice, under the usual assumptions stated above, an unbiased estimate of the error variance is given as the sum of the squared residuals divided by $n - p$ (where $p$ is the number of regression coefficients or parameters in the model):

\begin{equation}
  \widehat{\sigma}^2 = \frac{1}{n - p}\sum_{i = 1} ^ n r_i ^ 2,
\end{equation}

where $r_i = \left(Y_i - \widehat{Y}_i\right)$ is referred to as the _i_-th residual (i.e., the difference between the _i_-th observed and predicted response value). The quantity $\widehat{\sigma}^2$ is also referred to as the _mean square error_ (MSE) and is often used for comparing regression models (typically, the MSEs are computed on a separate validation set or using cross-validation). It's square root, denoted RMSE (for root mean square error) is also useful as it contains the same units as the response variable. In R, the RMSE of a linear model can be extracted using the `sigma()` function:

```{r model1-sigma}
sigma(model1)  # RMSE
sigma(model1)^2  # MSE
```

Note that the RMSE is also reported as the `Residual standard error` in the output from `summary()`.


### Inference

How accurate are the LS of $\beta_0$ and $\beta_1$? Point estimates by themselves are not very useful. It is often desirable to associate some measure of an estimates variability. The variability of an estimate is often measured by its _standard error_ (SE)---the square root of its variance. If we assume that the errors in the linear regression model are $\stackrel{iid}{\sim} \left(0, \sigma^2\right)$, then simple expressions for the SEs of the estimated coefficients exist and are displayed in the column labeled `Std. Error` in the output from `summary()`. From this, we can also derive simple $t$-tests to understand if the individual coefficients are statistically significant from zero. The _t_-statistics for such a test are nothing more than the estimated coefficients divided by their corresponding estimated standard errors (i.e., in the output from `summary()`, `t value` = `Estimate` / `Std. Error`). The reported _t_-statistics measure the number of standard deviations each coefficient is away from 0. Thus, large _t_-statistics (greater than two in absolute value, say) roughly indicate statistical significance at the $\alpha = 0.05$ level. The _p_-values for these tests are also reported by `summary()` in the column labeled `Pr(>|t|)`. 

Under the same assumptions, we can also derive confidence intervals for the coefficients. The formula for the traditional $100\left(1 - \alpha\right)$% confidence interval for $\beta_j$ is

\begin{equation}
  \widehat{\beta}_j \pm t_{1 - \alpha / 2, n - p} \widehat{SE}\left(\widehat{\beta}_j\right),
  (\#eq:conf-int)
\end{equation}

In R, we can construct such (one-at-a-time) confidence intervals for each coefficient using `confint()`. For example, a 95% confidence intervals for the coefficients in our SLR example can be computed using

```{r}
confint(model1, level = 0.95)
```

To interpret, we estimate with 90% confidence that the mean selling price increases between `r round(confint(model1)[2L, 1L], digits = 2)` and `r round(confint(model1)[2L, 2L], digits = 2)` for each additional one square foot of above ground living space. We can also conclude that the slope $\beta_1$ is statistically significant from zero (or any other pre-specified value not included in the interval) at the $\alpha = 0.05$ level. This is also supported by the output from `summary()`.

```{block2, type="note"}
Most statistical software, including R, will include estimated standard errors, _t_-statistics, etc. as part of its regression output. However, it is important to remember that such quantities depend on three major assumptions of the linear regresion model:

1. Independent observations
2. The random errors have mean zero, and constant variance
3. The random errors are normally distributed

If any or all of these assumptions are violated, then remdial measures nned to be taken. For instance, _weghted least squares_ (and other procedures) can be used the constant variance assumption is violated. Transformations (of both the response and features) can also help to correct departures from these assumptions.
```


## Multiple linear regression {#multi-lm}

### TODO:

1. Qualitative variables/factors
2. Interactions
3. Residual analysis

In practice, we often have more than one predictor. For example, with the Ames housing data, we may wish to understand if above ground square footage (`Gr_Liv_Area`) and the year the house was built (`Year_Built`) are (linearly) related to sales price (`Sale_Price`). We can extend the SLR model so that it can directly accommodate multiple predictors; this is referred to as a _multiple linear regression_ (MLR) model. With two predictors, the MLR model becomes: 

\begin{equation}
  Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon,
\end{equation}

where $X_1$ and $X_2$ are features of interest. In our Ames housing exmple, $X_1$ represents `Gr_Liv_Area` and $X_2$ represents `Year_Built`. 

In R, multiple linear regression models can be fit by separating all the features of interest with a `+`:

```{r model2}
(model2 <- lm(Sale_Price ~ Gr_Liv_Area + Year_Built, data = trn))
```

Alternatively, we can use `update()` to update the model formula used in `model`. The new formula can use a `.` as short hand for keep everything on either the left or right hand side of the formula, and a `+` or `-` can be used to add or remove terms from the original model, respectively. In the case of adding `Year_Built` to to `model1`, we could've used:

```{r model2-using-update}
(model2 <- update(model1, . ~ . + Year_Built))
```

The LS estimates of the regression coefficients are $\widehat{\beta}_1 =$ `r round(coef(model2)[2L], digits = 3)` and $\widehat{\beta}_2 =$ `r round(coef(model2)[3L], digits = 3)` (the estimated intercept is `r round(coef(model2)[1L], digits = 3)`. In other words, every one square foot increase to above ground square footage is associated with an additional `r scales::dollar(coef(model2)[2L])` in **mean selling price** when holding the year the house was built constant. Likewise, for every year newer a home is there is approximately an increase of `r scales::dollar(coef(model2)[3L])` in selling price when holding the above ground square footage constant.

A contour plot of the fitted regression surface is displayed in the left side of Figure \@ref(fig:07-mlr-fit) below. Note how the fitted regression surface is flat (i.e., it does not twist or bend). This is true for all linear models that include only *main effects* (i.e., terms involving only a single predictor). One way to model curvature is to include *interaction effects*. An interaction occurs when the effect of one predictor on the response depends on the values of other predictors. In linear regression, interactions are captured via products of features (i.e., $X1 \times X_2$). A model with two main effects can also include a two-way interaction. For example, to include an interaction between $X_1 =$ `Gr_Liv_Area` and $X_2 =$ `Year_Built`, we would include an additional product term:

\begin{equation}
  Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon.
\end{equation}

Note that in R, we use the `:` operator to include an interaction (techincally, we could use `*` as well, but `x1 * x2` is shortcut for `x1 + x2 + x1:x2` so is slightly redundant):

```{r model2-w-interaction}
lm(Sale_Price ~ Gr_Liv_Area + Year_Built + Gr_Liv_Area : Year_Built, data = trn)
```

A contour plot of the fitted regression surface with interaction is displayed in the right side of Figure \@ref(fig:07-mlr-fit). Note the curvature in the contour lines. 

```{block, type="note"}
Interaction effects are quite prevelant in predictive modeling. Since linear models are an example of parametric modeling, it is up to the analyst to decide if and when to include interaction effects. In later chapters, we'll discuss algorithms that can automatically detect and incoproarte interaction effects (albeit in different ways). It is also important to understand a concept called the *hierachy principle*---which demands that all lower-order terms corresponding to an interaction be retained in the model---when concidering interaction effects in linear regression models.
```

```{r 07-mlr-fit, echo=FALSE, fig.width=10, fig.height=4.5, fig.cap="In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The 'best-fit' plane minimizes the sum of squared errors between the actual sales price (individual dots) and the predicted sales price (plane)."}
# Fitted models
fit1 <- lm(Sale_Price ~ Gr_Liv_Area + Year_Built, data = trn)
fit2 <- lm(Sale_Price ~ Gr_Liv_Area * Year_Built, data = trn)

# Regression plane data
plot_grid <- expand.grid(
  Gr_Liv_Area = seq(from = min(trn$Gr_Liv_Area), to = max(trn$Gr_Liv_Area), 
                    length = 100), 
  Year_Built = seq(from = min(trn$Year_Built), to = max(trn$Year_Built), 
                   length = 100)
)
plot_grid$y1 <- predict(fit1, newdata = plot_grid)
plot_grid$y2 <- predict(fit2, newdata = plot_grid)

# Level plots
p1 <- ggplot(plot_grid, aes(x = Gr_Liv_Area, y = Year_Built, 
                            z = y1, fill = y1)) +
  geom_tile() +
  geom_contour(color = "white") +
  viridis::scale_fill_viridis(name = "Predicted\nvalue", option = "inferno") +
  theme_bw() +
  ggtitle("Main effects only")
p2 <- ggplot(plot_grid, aes(x = Gr_Liv_Area, y = Year_Built, 
                            z = y2, fill = y1)) +
  geom_tile() +
  geom_contour(color = "white") +
  viridis::scale_fill_viridis(name = "Predicted\nvalue", option = "inferno") +
  theme_bw() +
  ggtitle("Main effects with two-way interaction")
gridExtra::grid.arrange(p1, p2, nrow = 1)
```

In general, we can include as many predicotrs as we want, as long as we have more rows than parmaters! The general multiple linear regression model with *p* distinct predictors is

\begin{equation}
  Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon,
\end{equation}

where $X_i$ for $i = 1, 2, \dots, p$ are the predictors of interest. Note some of these may represent interactions (e.g., $X_3 = X_1 \times X_2$) between or transformations^[Transformations of the features serve a number of purposes (e.g., modeling nonlinear relationships or alleviating departures from common regression assumptions). See @kutner-2005-applied for details.] (e.g., $X_4 = \sqrt{X_1}$) of the original features. Unfortunately, visualizing beyond three dimensions is not practical as our best-fit plane becomes a hyperplane. However, the motivation remains the same where the best-fit hyperplane is identified by minimizing the RSS. The below creates a third model where we use all features in our data set to predict `Sale_Price`.

```{block, type="tip"}
When fitting an MLR model in R with may terms, we can make use of the `.` notation for conveniance. For example, the dot in `lm(Sale_Price ~ ., data = trn)` is shorthand for regressing `Sale_Price` onto all other columns `trn` whereas `lm(Sale_Price ~ .^2, data = trn)` will regress `Sale_Price` on all other columns in `trn` while also including all possible two-way interactions effects. The use of `.` within `lm()` is illustrated below for the Ames housing example where we now include all possible main effects.
```

```{r model3}
model3 <- lm(Sale_Price ~ ., data = trn)  # include all possible main effects
broom::tidy(model3)  # print estimated coefficients in a tidy data frame
```


## Assessing Model Accuracy

We've fit three main effects models to the Ames housing data: a single predicor, two two predictors, and all possible predictors. But the question remains, which model is "best"? To answer this question we have to define what we mean by "best". In our case, we'll use the RMSE metric and cross-validation (Section \@ref(cv)) to determine the "best" model. We can use the `caret::train()` function to to train a linear model (i.e., `method = "lm"`) using cross-validation (or a variety of other validation methods). In practice, a number of factors should be considered in determining a "best" model (e.g., time constraints, model production cost, predictive accuracy, etc.). The benefit of __caret__ is that it provides built-in cross-validation capabilities, whereas the `lm()` function does not^[Although general cross-validation is not avilable in `lm()` alone, a simple metric calles the _PRESS_ statistic, for __PRE__dictive __S__um of __S__quare, (equivalent to a _leave-one-out_ cross-validated RMSE) can be computed by summing the PRESS residuals which are available using `rstandard(<lm-model-name>, type = "predictive")`. See `?rstandard` for details.]. The following code chunk uses `caret::train()` to refit `model1` using 10-fold cross-validation:

```{r model1-accuracy}
# Use caret package to train model using 10-fold cross-validation
set.seed(123)  # for reproducibility
(cv_model1 <- train(
  form = Sale_Price ~ Gr_Liv_Area, 
  data = trn, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
))
```

The resulting cross-validated RMSE is `r scales::dollar(cv_model1$results$RMSE)` (this is the average RMSE across the 10 cross-validation folds). How should we interpret this? When applied to unseen data, the predictions this model makes are, on average, about \$56,873 off from the actual sale price. 

We can perform cross validation on the other two models in a similar fashion, which we do in the code chunk below. 

```{r mult-models}
#
set.seed(123)
cv_model2 <- train(
  Sale_Price ~ Gr_Liv_Area + Year_Built, 
  data = trn, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

#
set.seed(123)
cv_model3 <- train(
  Sale_Price ~ ., 
  data = trn, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

# Extract out of sample performance measures
summary(resamples(list(
  model1 = cv_model1, 
  model2 = cv_model2, 
  model3 = cv_model3
)))
```

Extracting the results for each model, we see that by adding more information via more predictors, we are able to improve the out-of-sample cross validation performance metrics. Specifically, our average prediction RMSE reduces from `r scales::dollar(cv_model2$results$RMSE)` (the model with two predictors) down to `r scales::dollar(cv_model3$results$RMSE)` (for our full model). In this case, the model with all possible main effects performs the "best" (comapred with the other two).


## Model concerns

As previously stated, linear regression has been a popular modeling tool due to the ease of interpreting the coefficients. However, linear regression makes several strong assumptions that are often violated as we include more predictors in our model. Violation of these assumptions can lead to flawed interpretation of the coefficients and prediction results. 

__1. Linear relationship:__ Linear regression assumes a linear relationship between the predictor and the response variable. When a linear relationship does not hold then the coefficient estimate makes a flawed assumption that a constant relationship holds. However, as discussed in Chapter \@ref(descriptive), non-linear relationships can be made linear (or near-linear) by applying power transformations to the response and/or predictor. For example, Figure \@ref(fig:07-linear-relationship) illustrates the relationship between sale price and the year a home was built. The left plot illustrates the non-linear relationship that exists. However, we can achieve a near-linear relationship by log transforming sale price; although some non-linearity still exists for older homes.


```{r 07-linear-relationship, fig.align='center', fig.width=8, fig.height=3.5, fig.cap="Linear regression assumes a linear relationship between the predictor(s) and the response variable; however, non-linear relationships can often be altered to be near-linear by appying a transformation to the variable(s)."}
p1 <- ggplot(trn, aes(Year_Built, Sale_Price)) + 
  geom_point(size = 1, alpha = .4) +
  geom_smooth(se = FALSE) +
  scale_y_continuous("Sale price", labels = scales::dollar) +
  xlab("Year built") +
  ggtitle("Non-transformed variables with a \nnon-linear relationship.")

p2 <- ggplot(trn, aes(Year_Built, Sale_Price)) + 
  geom_point(size = 1, alpha = .4) + 
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_log10("Sale price", labels = scales::dollar, breaks = seq(0, 400000, by = 100000)) +
  xlab("Year built") +
  ggtitle("Transforming variables can provide a \nnear-linear relationship.")

gridExtra::grid.arrange(p1, p2, nrow = 1)
```

__2. Constant variance among residuals:__ Linear regression assumes the variance among error terms ($\epsilon_1, \epsilon_2, ..., \epsilon_p$) are constant (also referred to as homoscedasticity). When residuals are not constant, the _p_-values and confidence intervals of the coefficients are invalid resulting in invalid prediction estimates and confidence intervals. Similar to the linear relationships assumption, non-constant variance can often be resolved with variable transformations or by including additional predictors. For example, Figure \@ref(fig:07-homoskedasticity) shows residuals across predicted values for our linear regression models `model1` and `model3`. `model1` displays a classic violation of constant variance with cone-shaped residuals. However, `model3` appears to have near-constant variance.

```{block, type="tip"}
The `broom::augment` function is an easy way to add model results to each observation (i.e. predicted values, residuals).
```

```{r 07-homoskedasticity, fig.align='center', fig.width=8, fig.height=3.5, fig.cap="Linear regression assumes constant variance among the residuals. `model1` (left) shows definitive signs of heteroskedasticity whereas `model3` (right) appears to have constant variance."}

df1 <- broom::augment(cv_model1$finalModel, data = trn)

p1 <- ggplot(df1, aes(.fitted, .resid)) + 
  geom_point(size = 1, alpha = .4) +
  xlab("Predicted values") +
  ylab("Residuals") +
  ggtitle("Model 1",
    subtitle = "Sale_Price ~ Gr_Liv_Area")

df2 <- broom::augment(cv_model3$finalModel, data = trn)

p2 <- ggplot(df2, aes(.fitted, .resid)) + 
  geom_point(size = 1, alpha = .4)  +
  xlab("Predicted values") +
  ylab("Residuals") +
  ggtitle("Model 3",
    subtitle = "Sale_Price ~ .")

gridExtra::grid.arrange(p1, p2, nrow = 1)
```

__3. No autocorrelation:__ Linear regression assumes the error terms are also independent and uncorrelated. If in fact, there is correlation among the error terms, then the estimated standard errors of the coefficients will be biased leading to prediction intervals being narrower than they should be. For example, the left plot in Figure \@ref(fig:07-autocorrelation) displays the residuals (y-axis) to the observation ID (x-axis) for `model1`. A clear pattern exists suggesting that information about $\epsilon_1$ provides information about $\epsilon_2$.

This pattern is a result of the data being ordered by neighborhood, which we have not accounted for in this model. Consequently, the residuals for homes in the same neighborhood are correlated (homes within a neighborhood are typically the same size and can often contain similar features). Since the `Neighborhood` predictor is included in `model3` (right plot), our errors are no longer correlated.

```{r 07-autocorrelation, fig.align='center', fig.width=8, fig.height=3.5, fig.cap="Linear regression assumes uncorrelated errors. The residuals in `model1` (left) have a distinct pattern suggesting that information about $\\epsilon_1$ provides information about $\\epsilon_2$. Whereas residuals in `model3` have no signs of autocorrelation."}

df1 <- mutate(df1, id = row_number())
df2 <- mutate(df2, id = row_number())

p1 <- ggplot(df1, aes(id, .resid)) + 
  geom_point(size = 1, alpha = .4) +
  xlab("Row ID") +
  ylab("Residuals") +
  ggtitle("Model 1",
    subtitle = "Correlated residuals.")

p2 <- ggplot(df2, aes(id, .resid)) + 
  geom_point(size = 1, alpha = .4) +
  xlab("Row ID") +
  ylab("Residuals") +
  ggtitle("Model 3",
    subtitle = "Uncorrelated residuals.")

gridExtra::grid.arrange(p1, p2, nrow = 1)
```


__4. More observations than predictors:__ Although not an issue with the Ames housing data, when the number of features exceed the number of observations ($p > n$), the OLS solution matrix is not invertible. This causes significant issues because it means the least-squares estimates are not unique. In fact, there are an infinite set of solutions available so we lose our ability to meaningfully interpret coefficients. Consequently, to resolve this issue an analyst can remove variables until $p < n$ and then fit an OLS regression model. Although an analyst can use pre-processing tools to guide this manual approach [@apm, 43-47], it can be cumbersome and prone to errors. Alternatively, we will introduce regularized regression in Chapter \@ref(regularize) which provides you an alternative linear regression technique when $p > n$.

__5. No or little multicollinearity:__  _Collinearity_ refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. In fact, collinearity can cause predictor variables to appear as statistically insignificant when in fact they are significant. This, obviously, leads to inaccurate interpretation of coefficients and identifying influential predictors. 

For example, in our data, `Garage_Area` and `Garage_Cars` are two variables that have a correlation of `r round(cor(trn$Garage_Area, trn$Garage_Cars), 2)` and both variables are strongly correlated to our response variable (`Sale_Price`). Looking at our full model where both of these variables are included, we see that `Garage_Area` is found to be statistically significant but `Garage_Cars` is not.

```{r, collinearity-1}
# fit with two strongly correlated variables
summary(cv_model3) %>%
  tidy() %>%
  filter(term %in% c("Garage_Area", "Garage_Cars"))
```

However, if we refit the full model without `Garage_Area`, the coefficient estimate for `Garage_Cars` increases three fold and becomes statistically significant.

```{r collinearity-2}
# model without Garage_Area
set.seed(123)
mod_wo_Garage_Area <- train(
  Sale_Price ~ ., 
  data = select(trn, -Garage_Area), 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
  )

summary(mod_wo_Garage_Area) %>%
  tidy() %>%
  filter(term == "Garage_Cars")
```

This reflects the instability in the linear regression model caused by the between-predictor relationships and this instability gets propagated directly to the model predictions. Considering 16 of our 34 numeric predictors have medium to strong correlation (Section \@ref(pca)), the biased coefficients of these predictors are likely restricting the predictive accuracy of our model. How can we control for this problem?  One option is to manually remove one of the offending predictors. However, when the number of predictors is large such as in our case, this becomes difficult. Moreover, relationships between predictors can become complex and involve many predictors. In these cases, manual removal of specific predictors may not be possible. Consequently, the following sections offers two simple extensions of linear regression where dimension reduction is applied prior to performing linear regression. Chapter \@ref(regularize) offers a modified regression approach that helps to deal with the problem. And future chapters provide alternative methods that are not effected by multicollinearity.

## Principal component regression

As discussed in Chapter \@ref(unsupervised), principal components analysis can be used to represent correlated variables in a lower dimension and the resulting components can be used as predictors in the linear regression model. This two-step process is known as __principal component regression__ (PCR) [@massy1965principal]. 

Performing PCR with __caret__ is an easy extension from our previous model. We simply change the `method` to "pcr" within `train` to perform PCA on all our numeric predictors prior to applying the multiple regression. Often, we can greatly improve performance by only using a small subset of all principal components as predictors. Consequently, you can think of the number of principal components as a tuning parameter (see Section \@ref(tune)). The following performs cross validated PCR with $1, 2, \dots, 20$ principal components, and Figure \@ref(fig:pcr-regression) illustrates the cross-validated RMSE. You can see a significant drop in prediction error using just five principal components followed by a gradual decrease. Using 17 principal components provided the lowest RMSE of \$35,769.99 (see `cv_model_pcr` for a comparison of the cross-validated results).

```{block, type="warning"}
Per Section \@ref(pca), don't forget to center and scale your predictors, which you can do by incorporating the `preProcess` argument.
```


```{r pcr-regression, fig.height=3.5, fig.width=6, fig.cap="The 10-fold cross valdation RMSE obtained using PCR with 1-20 principal components."}

# perform 10-fold cross validation on a PCR model tuning the number of
# principal components to use as predictors from 1-20
set.seed(123)
cv_model_pcr <- train(
  Sale_Price ~ ., 
  data = trn, 
  method = "pcr",
  trControl = trainControl(method = "cv", number = 10),
  preProcess = c("zv", "center", "scale"),
  tuneLength = 20
  )

# model with lowest RMSE
cv_model_pcr$bestTune

# plot cross-validated RMSE
plot(cv_model_pcr)
```

By controlling for multicollinearity with PCR, we saw significant improvement in our predictive accuary (reducing out-of-sample RMSE from 41438 down to 35770). However, since PCR is a two step process, the PCA step does not consider any aspects of the response when it selects the components. Consequently, the new predictors produced by the PCA step are not designed to maximize the relationship with the response. Instead, it simply seeks to reduce the variability present throughout the predictor space. If that variability happens to be related to the response variability, then PCR has a good chance to identify a predictive relationship, as in our case. If, however, the variability in the predictor space is not related to the variability of the response, then PCR can have difficulty identifying a predictive relationship when one might actually exists (i.e. we may actually experience a decrease in our predictive accuracy). Thus, an alternative approach to reduce the impact of multicollinearity is partial least squares.

## Partial least squares

_Partial least squares (PLS)_ can be viewed as a supervised dimension reduction procedure [@apm]. Similar to PCR this technique also constructs a set of linear combinations of the inputs for regression, but unlike PCR it uses the response variable to aid the construction of the principal components. Thus, we can think of PLS as a supervised dimension reduction procedure that finds new features that not only appromxate the old features well, but also that are related to the response.

__TODO: IMAGE of PCR vs PLS__

Referring back to Equation \@ref(eq:pca1), PLS will compute the first principal ($z_1$) by setting each $\phi_{j1}$ to the coefficient from a SLR model of $y$ onto that respective $x_j$. One can show that this coefficient is proportional to the correlation between $y$ and $x_j$. Hence, in computing $z_1 = \sum^p_{j=1} \phi_{j1}x_j$, PLS places the highest weight on the variables that are most strongly related to the response.

To compute the second principal ($z_2$), we first regress each variable on $z_1$. The residuals from this regression captures the remaining signal that has not been explained by the first principal. We substitute these residual values for the predictor values in Equation \@ref(eq:pca2). This process continues until all $m$ components have been computed and then we use OLS to regress the response on $z_1, \dots, z_m$.

```{block, type="note"}
See @friedman2001elements and @geladi1986partial for a thorough discussion of PLS.
```

Similar to PCR, we can easily fit a PLS model by changing the `method` argument in `caret::train`. As with PCR, the number of principal components to use is a tuning parameter that is determined by the model that maximize predictive accuracy (minimizes RMSE in this case). The following performs cross validated PLS with $1, 2, \dots, 20$ principal components, and Figure \@ref(fig:pls-regression) illustrates the cross-validated RMSE. You can see a greater drop in prediction error than PCR. Using PLS with $m = 10$ principal components provided the lowest RMSE of \$31,522.47.

```{r pls-regression, fig.height=3.5, fig.width=6, fig.cap="The 10-fold cross valdation RMSE obtained using PLS with 1-20 principal components."}
# perform 10-fold cross validation on a PLS model tuning the number of
# principal components to use as predictors from 1-20
set.seed(123)
cv_model_pls <- train(
  Sale_Price ~ ., 
  data = trn, 
  method = "pls",
  trControl = trainControl(method = "cv", number = 10),
  preProcess = c("zv", "center", "scale"),
  tuneLength = 20
  )

# model with lowest RMSE
cv_model_pls$bestTune

# plot cross-validated RMSE
plot(cv_model_pls)
```


## Feature Interpretation {#lm-model-interp}

Once we've found the model that minimizes the predictive accuracy, our next goal is to interpret the model structure. Linear regression models provide a very intuitive model structure as they assume a ___monotonic linear relationship___ between the predictor variables and the response. The _linear_ relationship part of that statement just means, for a given predictor variable, it assumes for every one unit change in a given predictor variable there is a constant change in the response. As discussed earlier in the chapter, this constant change is provided by the given coefficient for a predictor. The _monotonic_ relationship means that a given predictor variable will always have a positive or negative relationship. But how do we determine the most influential variables?

Variable importance seeks to identify those variables that are most influential in our model. For linear regression models, this is most often measured by the absolute value of the _t_-statistic (Equation \@ref(eq:tstat)) for each model parameter used. For a PLS model, variable importance is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the RSS across the number of PLS components and are computed separately for each outcome. Therefore, the contribution of the coefficients are weighted proportionally to the reduction in the RSS.

We can use `vip::vip` to extract and plot the most important variables. The importance measure is normalized from 100 (most important) to 0 (least important). Figure \@ref(fig:pls-vip) illustrates that the top 4 most important variables are `Gr_liv_Area`, `First_Flr_SF`, `Garage_Area`, and `Garage_Cars` respectively.

```{r pls-vip, fig.cap="Top 20 most important variables for the PLS model."}
vip(cv_model_pls, num_features = 20, method = "model")
```

As stated earlier, linear regression models assume a monotonic linear relationship. To illustrate this, we can apply partial dependence plots (PDPs). PDPs plot the change in the average predicted value ($\hat y$) as specified feature(s) vary over their marginal distribution. As you will see in later chapters, PDPs become more useful when non-linear relationships are present. However, PDPs of linear models help illustrate how a fixed change in $x_i$ relates to a fixed linear change in $\hat y_i$. 

```{block, type="tip"}
The __pdp__ package [@pkg-pdp] provides convenient functions for computing and plotting PDPs. For example, the following code chunk would plot the PDP for the `Gr_Liv_Area` predictor.

`pdp::partial(cv_model_pls, pred.var = "Gr_Liv_Area", grid.resolution = 20) %>% autoplot()`
```


All four of the most important predictors have a positive relationship with sale price; however, we see that the slope ($\beta_i$) is steepest for the most important predictor and gradually decreases for lessor important variables.

```{r, echo=FALSE, fig.height=5, fig.width=7, fig.cap="Partial dependence plots for the first four most important variables."}
p1 <- pdp::partial(cv_model_pls, pred.var = "Gr_Liv_Area", grid.resolution = 20) %>% 
  autoplot() +
  scale_y_continuous(limits = c(0, 300000), labels = scales::dollar)
p2 <- pdp::partial(cv_model_pls, pred.var = "First_Flr_SF", grid.resolution = 20) %>% 
  autoplot() +
  scale_y_continuous(limits = c(0, 300000), labels = scales::dollar)
p3 <- pdp::partial(cv_model_pls, pred.var = "Garage_Area", grid.resolution = 20) %>% 
  autoplot() +
  scale_y_continuous(limits = c(0, 300000), labels = scales::dollar)
p4 <- pdp::partial(cv_model_pls, pred.var = "Garage_Cars", grid.resolution = 4) %>% 
  autoplot() +
  scale_y_continuous(limits = c(0, 300000), labels = scales::dollar)

grid.arrange(p1, p2, p3, p4, nrow = 2)
```


## Final thoughts

Linear regression is a great starting point in learning more advanced predictive analytic approaches because, in its simplest form, it is very intuitive and easy to interpret. Training a linear regression model is very easy and computationally efficient. However, due to the many assumptions required, the disadvantages of linear regression often outweigh their benefits. In our example, we saw how multicollinearity was interferring with predictive accuracy. By controlling multicollinearity with PCR and PLS we were able to improve predictive accuracy. Later chapters will build on the concepts illustrated in this chapter and will compare cross-validated performance results to identify the best predictive model. The following summarizes some of the advantages and disadvantages discussed regarding linear regression.

__FIXME: refine this section__

__Advantages__:

* Normal linear regression has no hyperparameters to tune and PCR and PLS have only one hyperparameter to tune; making these methods very simple to train.
* Computationally efficient - relatively fast compared to other algorithms in this book and does not require large memory.
* Easy to interpret results.

__Disadvantages__:

* Makes strong asssumptions about the data.
* Does not handle missing data - must impute or remove observations with missing values.
* Not robust to outliers as they can still bias the coefficients.
* Assumes relationships between predictors and response variable to be monotonic linear (always increasing or decreasing in a linear fashion).
* Typically does not perform as well as more advanced methods that allow non-monotonic and non-linear relationships (i.e. random forests, gradient boosting machines, neural networks).
* Most large data sets violate one of the several assumptions made for linear regression to hold, which cause instability in the modeling results. 


## Learning more

This will get you up and running with linear regression. Keep in mind that there is a lot more you can dig into so the following resources will help you learn more:

- [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/)
- [Applied Predictive Modeling](http://appliedpredictivemodeling.com/)
- [Elements of Statistical Learning](https://statweb.stanford.edu/~tibs/ElemStatLearn/)