From cd5a0e3719f3394d80568efdc0f12ff6618c566d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mine=20=C3=87etinkaya-Rundel?= Date: Mon, 25 Sep 2023 16:11:32 -0400 Subject: [PATCH] Update freeze --- _freeze/07-model-slr/execute-results/html.json | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_freeze/07-model-slr/execute-results/html.json b/_freeze/07-model-slr/execute-results/html.json index d83c154b..4a529532 100644 --- a/_freeze/07-model-slr/execute-results/html.json +++ b/_freeze/07-model-slr/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "2fb8b1860d00b64100626e980a788fc0", + "hash": "4c1f8a9b60fb03172f92dbb79d1bc468", "result": { - "markdown": "---\noutput: html_document\neditor_options: \n chunk_output_type: console\n---\n\n\n\n\n# Linear regression with a single predictor {#sec-model-slr}\n\n::: {.chapterintro data-latex=\"\"}\nLinear regression is a very powerful statistical technique.\nMany people have some familiarity with regression models just from reading the news, where straight lines are overlaid on scatterplots.\nLinear models can be used for prediction or to evaluate whether there is a linear relationship between a numerical variable on the horizontal axis and the average of the numerical variable on the vertical axis.\n:::\n\n## Fitting a line, residuals, and correlation {#fit-line-res-cor}\n\nWhen considering linear regression, it's helpful to think deeply about the line fitting process.\nIn this section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce a new statistic called *correlation*.\n\n### Fitting a line to data\n\n@fig-perfLinearModel shows two variables whose relationship can be modeled perfectly with a straight line.\nThe equation for the line is $y = 5 + 64.96 x.$ Consider what a perfect linear relationship means: we know the exact value of $y$ just by knowing the value of $x.$ A perfect linear relationship is unrealistic in almost any natural process.\nFor example, if we took family income ($x$), this value would provide some useful information about how much financial support a college may offer a prospective student ($y$).\nHowever, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's finances.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Requests from twelve separate buyers were simultaneously placed with a trading\ncompany to purchase Target Corporation stock (ticker TGT, December 28th, 2018),\nand the total cost of the shares were reported. Because the cost is computed using\na linear formula, the linear fit is perfect.](07-model-slr_files/figure-html/fig-perfLinearModel-1.png){#fig-perfLinearModel width=90%}\n:::\n:::\n\n\nLinear regression is the statistical method for fitting a line to data where the relationship between two variables, $x$ and $y,$ can be modeled by a straight line with some error:\n\n$$\ny = b_0 + b_1 \\ x + e\n$$\n\nThe values $b_0$ and $b_1$ represent the model's intercept and slope, respectively, and the error is represented by $e$.\nThese values are calculated based on the data, i.e., they are sample statistics.\nIf the observed data is a random sample from a target population that we are interested in making inferences about, these values are considered to be point estimates for the population parameters $\\beta_0$ and $\\beta_1$.\nWe will discuss how to make inferences about parameters of a linear model based on sample statistics in @sec-inf-model-slr.\n\n::: {.pronunciation data-latex=\"\"}\nThe Greek letter $\\beta$ is pronounced *beta*, listen to the pronunciation [here](https://youtu.be/PStgY5AcEIw?t=7).\n:::\n\nWhen we use $x$ to predict $y,$ we usually call $x$ the **predictor** variable and we call $y$ the **outcome**.\nWe also often drop the $e$ term when writing down the model since our main focus is often on the prediction of the average outcome.\n\n\n\n\n\nIt is rare for all of the data to fall perfectly on a straight line.\nInstead, it's more common for data to appear as a *cloud of points*, such as those examples shown in @fig-imperfLinearModel.\nIn each case, the data fall around a straight line, even if none of the observations fall exactly on the line.\nThe first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between $x$ and $y.$ The second plot shows an upward trend that, while evident, is not as strong as the first.\nThe last plot shows a very weak downward trend in the data, so slight we can hardly notice it.\nIn each of these examples, we will have some uncertainty regarding our estimates of the model parameters, $\\beta_0$ and $\\beta_1.$ For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less?\nAs we move forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three datasets where a linear model may be useful even though the data do\nnot all fall exactly on the line.\n](07-model-slr_files/figure-html/fig-imperfLinearModel-1.png){#fig-imperfLinearModel width=100%}\n:::\n:::\n\n\nThere are also cases where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful.\nOne such case is shown in @fig-notGoodAtAllForALinearModel where there is a very clear relationship between the variables even though the trend is not linear.\nWe discuss nonlinear trends in this chapter and the next, but details of fitting nonlinear models are saved for a later course.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The best fitting line for these data is flat, which is not a useful way to\ndescribe the non-linear relationship. These data are from a physics experiment.](07-model-slr_files/figure-html/fig-notGoodAtAllForALinearModel-1.png){#fig-notGoodAtAllForALinearModel width=90%}\n:::\n:::\n\n\n### Using linear regression to predict possum head lengths\n\nBrushtail possums are marsupials that live in Australia, and a photo of one is shown in @fig-brushtail-possum.\nResearchers captured 104 of these animals and took body measurements before releasing the animals back into the wild.\nWe consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum's head.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The common brushtail possum of Australia. Photo by Greg Schecter,\n[flic.kr/p/9BAFbR](https://flic.kr/p/9BAFbR), CC BY 2.0 license.\n](images/brushtail-possum/brushtail-possum.jpg){#fig-brushtail-possum fig-alt='Photograph of a common brushtail possum of Australia.' width=50%}\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n@fig-scattHeadLTotalL shows a scatterplot for the head length (mm) and total length (cm) of the possums.\nEach point represents a single possum from the data.\nThe head and total length variables are associated: possums with an above average total length also tend to have above average head lengths.\nWhile the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot showing head length against total length for 104 brushtail\npossums. A point representing a possum with head length 86.7 mm and total\nlength 84 cm is highlighted.](07-model-slr_files/figure-html/fig-scattHeadLTotalL-1.png){#fig-scattHeadLTotalL width=90%}\n:::\n:::\n\n\nWe want to describe the relationship between the head length and total length variables in the possum dataset using a line.\nIn this example, we will use the total length as the predictor variable, $x,$ to predict a possum's head length, $y.$ We could fit the linear relationship by eye, as in @fig-scattHeadLTotalLLine.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A reasonable linear model was fit to represent the relationship between\nhead length and total length.](07-model-slr_files/figure-html/fig-scattHeadLTotalLLine-1.png){#fig-scattHeadLTotalLLine width=90%}\n:::\n:::\n\n\nThe equation for this line is\n\n$$\n\\hat{y} = 41 + 0.59x\n$$\n\nA \"hat\" on $y$ is used to signify that this is an estimate.\nWe can use this line to discuss properties of possums.\nFor instance, the equation predicts a possum with a total length of 80 cm will have a head length of\n\n$$\n\\hat{y} = 41 + 0.59 \\times 80 = 88.2\n$$\n\nThe estimate may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.2 mm.\nAbsent further information about an 80 cm possum, the prediction for head length that uses the average is a reasonable estimate.\n\nThere may be other variables that could help us predict the head length of a possum besides its length.\nPerhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region.\nPlot A in @fig-scattHeadLTotalL-sex-age shows the relationship between total length and head length of brushtail possums, taking into consideration their sex.\nMale possums (represented by blue triangles) seem to be larger in terms of total length and head length than female possums (represented by red circles).\nPlot B in @fig-scattHeadLTotalL-sex-age shows the same relationship, taking into consideration their age.\nIt's harder to tell if age changes the relationship between total length and head length for these possums.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between total length and head length of brushtail possums,\ntaking into consideration their sex (Plot A) or age (Plot B).\n](07-model-slr_files/figure-html/fig-scattHeadLTotalL-sex-age-1.png){#fig-scattHeadLTotalL-sex-age width=90%}\n:::\n:::\n\n\nIn @sec-model-mlr, we'll learn about how we can include more than one predictor in our model.\nBefore we get there, we first need to better understand how to best build a linear model with one predictor.\n\n### Residuals {#resids}\n\n**Residuals** are the leftover variation in the data after accounting for the model fit:\n\n$$\n\\text{Data} = \\text{Fit} + \\text{Residual}\n$$\n\nEach observation will have a residual, and three of the residuals for the linear model we fit for the possum data are shown in @fig-scattHeadLTotalLLine-highlighted.\nIf an observation is above the regression line, then its residual, the vertical distance from the observation to the line, is positive.\nObservations below the line have negative residuals.\nOne goal in picking the right linear model is for these residuals to be as small as possible.\n\n\n\n\n\n@fig-scattHeadLTotalLLine-highlighted is almost a replica of @fig-scattHeadLTotalLLine, with three points from the data highlighted.\nThe observation marked by a red circle has a small, negative residual of about -1; the observation marked by a gray diamond has a large positive residual of about +7; and the observation marked by a pink triangle has a moderate negative residual of about -4.\nThe size of a residual is usually discussed in terms of its absolute value.\nFor example, the residual for the observation marked by a pink triangle is larger than that of the observation marked by a red circle because $|-4|$ is larger than $|-1|.$\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A reasonable linear model was fit to represent the relationship between\nhead length and total length, with three points highlighted.](07-model-slr_files/figure-html/fig-scattHeadLTotalLLine-highlighted-1.png){#fig-scattHeadLTotalLLine-highlighted width=90%}\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Residual: Difference between observed and expected.**\n\nThe residual of the $i^{th}$ observation $(x_i, y_i)$ is the difference of the observed outcome ($y_i$) and the outcome we would predict based on the model fit ($\\hat{y}_i$):\n\n$$\ne_i = y_i - \\hat{y}_i\n$$\n\nWe typically identify $\\hat{y}_i$ by plugging $x_i$ into the model.\n:::\n\n::: {.workedexample data-latex=\"\"}\nThe linear fit shown in @fig-scattHeadLTotalLLine-highlighted is given as $\\hat{y} = 41 + 0.59x.$ Based on this line, formally compute the residual of the observation $(76.0, 85.1).$ This observation is marked by a red circle in @fig-scattHeadLTotalLLine-highlighted.\nCheck it against the earlier visual estimate, -1.\n\n------------------------------------------------------------------------\n\nWe first compute the predicted value of the observation marked by a red circle based on the model:\n\n$$\n\\hat{y} = 41+0.59x = 41+0.59\\times 76.0 = 85.84\n$$\n\nNext we compute the difference of the actual head length and the predicted head length:\n\n$$\ne = y - \\hat{y} = 85.1 - 85.84 = -0.74\n$$\n\nThe model's error is $e = -0.74$ mm, which is very close to the visual estimate of -1 mm.\nThe negative residual indicates that the linear model overpredicted head length for this particular possum.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf a model underestimates an observation, will the residual be positive or negative?\nWhat about if it overestimates the observation?[^07-model-slr-1]\n:::\n\n[^07-model-slr-1]: If a model underestimates an observation, then the model estimate is below the actual.\n The residual, which is the actual observation value minus the model estimate, must then be positive.\n The opposite is true when the model overestimates the observation: the residual is negative.\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the residuals for the observation marked by a blue diamond, $(85.0, 98.6),$ and the observation marked by a pink triangle, $(95.5, 94.0),$ in the figure using the linear relationship $\\hat{y} = 41 + 0.59x.$[^07-model-slr-2]\n:::\n\n[^07-model-slr-2]: Gray diamond: $\\hat{y} = 41+0.59x = 41+0.59\\times 85.0 = 91.15 \\rightarrow e = y - \\hat{y} = 98.6-91.15=7.45.$ This is close to the earlier estimate of 7.\n pink triangle: $\\hat{y} = 41+0.59x = 97.3 \\rightarrow e = -3.3.$ This is also close to the estimate of -4.\n\nResiduals are helpful in evaluating how well a linear model fits a dataset.\nWe often display them in a scatterplot such as the one shown in @fig-scattHeadLTotalLResidualPlot for the regression line in @fig-scattHeadLTotalLLine-highlighted.\nThe residuals are plotted with their predicted outcome variable value as the horizontal coordinate, and the vertical coordinate as the residual.\nFor instance, the point $(85.0, 98.6)$ (marked by the blue diamond) had a predicted value of 91.4 mm and had a residual of 7.45 mm, so in the residual plot it is placed at $(91.4, 7.45).$ Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal, as indicated by the dashed line.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Residual plot for the model predicting head length from total length for\nbrushtail possums.](07-model-slr_files/figure-html/fig-scattHeadLTotalLResidualPlot-1.png){#fig-scattHeadLTotalLResidualPlot width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nOne purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model.\n@fig-sampleLinesAndResPlots shows three scatterplots with linear models in the first row and residual plots in the second row.\nCan you identify any patterns remaining in the residuals?\n\n------------------------------------------------------------------------\n\nIn the first dataset (first column), the residuals show no obvious patterns.\nThe residuals appear to be scattered randomly around the dashed line that represents 0.\n\nThe second dataset shows a pattern in the residuals.\nThere is some curvature in the scatterplot, which is more obvious in the residual plot.\nWe should not use a straight line to model these data.\nInstead, a more advanced technique should be used to model the curved relationship, such as the variable transformations discussed in @sec-transforming-data.\n\nThe last plot shows very little upwards trend, and the residuals also show no obvious patterns.\nIt is reasonable to try to fit a linear model to the data.\nHowever, it is unclear whether there is evidence that the slope parameter is different from zero.\nThe point estimate of the slope parameter, labeled $b_1,$ is not zero, but we might wonder if this could just be due to chance.\nWe will address this sort of scenario in @sec-inf-model-slr.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample data with their best fitting lines (top row) and their corresponding\nresidual plots (bottom row).](07-model-slr_files/figure-html/fig-sampleLinesAndResPlots-1.png){#fig-sampleLinesAndResPlots width=90%}\n:::\n:::\n\n\n\\clearpage\n\n### Describing linear relationships with correlation\n\nWe've seen plots with strong linear relationships and others with very weak linear relationships.\nIt would be useful if we could quantify the strength of these linear relationships with a statistic.\n\n::: {.important data-latex=\"\"}\n**Correlation: strength of a linear relationship.**\n\n**Correlation** which always takes values between -1 and 1, describes the strength and direction of the linear relationship between two variables.\nWe denote the correlation by $r.$\n\nThe correlation value has no units and will not be affected by a linear change in the units (e.g., going from inches to centimeters).\n:::\n\n\n\n\n\nWe can compute the correlation using a formula, just as we did with the sample mean and standard deviation.\nThe formula for correlation, however, is rather complex[^07-model-slr-3], and like with other statistics, we generally perform the calculations on a computer or calculator.\n\n[^07-model-slr-3]: Formally, we can compute the correlation for observations $(x_1, y_1),$ $(x_2, y_2),$ ..., $(x_n, y_n)$ using the formula\n\n$$\nr = \\frac{1}{n-1} \\sum_{i=1}^{n} \\frac{x_i-\\bar{x}}{s_x}\\frac{y_i-\\bar{y}}{s_y}\n$$\n\nwhere $\\bar{x},$ $\\bar{y},$ $s_x,$ and $s_y$ are the sample means and standard deviations for each variable.\n\n@fig-posNegCorPlots shows eight plots and their corresponding correlations.\nOnly when the relationship is perfectly linear is the correlation either -1 or 1.\nIf the relationship is strong and positive, the correlation will be near +1.\nIf it is strong and negative, it will be near -1.\nIf there is no apparent linear relationship between the variables, then the correlation will be near zero.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample scatterplots and their correlations. The first row shows variables\nwith a positive relationship, represented by the trend up and to the right.\nThe second row shows variables with a negative trend, where a large value\nin one variable is associated with a lower value in the other.\n](07-model-slr_files/figure-html/fig-posNegCorPlots-1.png){#fig-posNegCorPlots width=100%}\n:::\n:::\n\n\nThe correlation is intended to quantify the strength of a linear trend.\nNonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in @fig-corForNonLinearPlots.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample scatterplots and their correlations. In each case, there is a strong\nrelationship between the variables. However, because the relationship is\nnot linear, the correlation is relatively weak.\n](07-model-slr_files/figure-html/fig-corForNonLinearPlots-1.png){#fig-corForNonLinearPlots width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nNo straight line is a good fit for any of the datasets represented in Figure @fig-corForNonLinearPlots.\nTry drawing nonlinear curves on each plot.\nOnce you create a curve for each, describe what is important in your fit.[^07-model-slr-4]\n:::\n\n[^07-model-slr-4]: We'll leave it to you to draw the lines.\n In general, the lines you draw should be close to most points and reflect overall trends in the data.\n\n::: {.workedexample data-latex=\"\"}\n@fig-crop-yields-af displays the relationships between various crop yields in countries.\nIn the plots, each point represents a different country.\nThe x and y variables represent the proportion of total yield in the last 50 years which is due to that crop type.\n\nOrder the six scatterplots from strongest negative to strongest positive linear relationship.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationships between various crop yields in over 200 countries.\n](07-model-slr_files/figure-html/fig-crop-yields-af-1.png){#fig-crop-yields-af width=100%}\n:::\n:::\n\n\n------------------------------------------------------------------------\n\nThe order of most negative correlation to most positive correlation is:\n\n$$\nA \\rightarrow D \\rightarrow B \\rightarrow C \\rightarrow E \\rightarrow F\n$$\n\n- Plot A - bananas vs. potatoes: -0.62\n- Plot B - cassava vs. soybeans: -0.21\n- Plot C - cassava vs. maize: -0.26\n- Plot D - cocoa vs. bananas: 0.22\n- Plot E - peas vs. barley: 0.31\n- Plot F - wheat vs. barley: 0.21\n:::\n\nOne important aspect of the correlation is that it's *unitless*.\nThat is, unlike a measurement of the slope of a line (see the next section) which provides an increase in the y-coordinate for a one unit increase in the x-coordinate (in units of the x and y variable), there are no units associated with the correlation of x and y.\n@fig-bdims-units shows the relationship between weights and heights of 507 physically active individuals.\nIn Plot A, weight is measured in kilograms (kg) and height in centimeters (cm).\nIn Plot B, weight has been converted to pounds (lbs) and height to inches (in).\nThe correlation coefficient ($r = 0.72$) is also noted on both plots.\nWe can see that the shape of the relationship has not changed, and neither has the correlation coefficient.\nThe only visual change to the plot is the axis *labeling* of the points.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots, both displaying the relationship between weights and\nheights of 507 physically healthy adults. In Plot A, the units are\nkilograms and centimeters. In Plot B, the units are pounds and inches.\nAlso noted on both plots is the correlation coefficient, $r = 0.72$.\n](07-model-slr_files/figure-html/fig-bdims-units-1.png){#fig-bdims-units width=90%}\n:::\n:::\n\n\n## Least squares regression {#sec-least-squares-regression}\n\nFitting linear models by eye is open to criticism since it is based on an individual's preference.\nIn this section, we use *least squares regression* as a more rigorous approach to fitting a line to a scatterplot.\n\n### Gift aid for freshman at Elmhurst College\n\nThis section considers a dataset on family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois.\nGift aid is financial aid that does not need to be paid back, as opposed to a loan.\nA scatterplot of these data is shown in @fig-elmhurstScatterWLine along with a linear fit.\nThe line follows a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Gift aid and family income for a random sample of 50 freshman students from\nElmhurst College.](07-model-slr_files/figure-html/fig-elmhurstScatterWLine-1.png){#fig-elmhurstScatterWLine width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIs the correlation positive or negative in @fig-elmhurstScatterWLine?[^07-model-slr-5]\n:::\n\n[^07-model-slr-5]: Larger family incomes are associated with lower amounts of aid, so the correlation will be negative.\n Using a computer, the correlation can be computed: -0.499.\n\n### An objective measure for finding the best line\n\nWe begin by thinking about what we mean by the \"best\" line.\nMathematically, we want a line that has small residuals.\nBut beyond the mathematical reasons, hopefully it also makes sense intuitively that whatever line we fit, the residuals should be small (i.e., the points should be close to the line).\nThe first option that may come to mind is to minimize the sum of the residual magnitudes:\n\n$$\n|e_1| + |e_2| + \\dots + |e_n|\n$$\n\nwhich we could accomplish with a computer program.\nThe resulting dashed line shown in @fig-elmhurstScatterW2Lines demonstrates this fit can be quite reasonable.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Gift aid and family income for a random sample of 50 freshman students from\nElmhurst College. The dashed line represents the line that minimizes the sum of\nthe absolute value of residuals, the solid line represents the line that minimizes\nthe sum of squared residuals, i.e., the least squares line.](07-model-slr_files/figure-html/fig-elmhurstScatterW2Lines-1.png){#fig-elmhurstScatterW2Lines width=90%}\n:::\n:::\n\n\nHowever, a more common practice is to choose the line that minimizes the sum of the squared residuals:\n\n$$\ne_{1}^2 + e_{2}^2 + \\dots + e_{n}^2\n$$\n\nThe line that minimizes this least squares criterion is represented as the solid line in @fig-elmhurstScatterW2Lines and is commonly called the **least squares line**.\nThe following are three possible reasons to choose the least squares option instead of trying to minimize the sum of residual magnitudes without any squaring:\n\n\n\n\n\n1. It is the most commonly used method.\n2. Computing the least squares line is widely supported in statistical software.\n3. In many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.\n4. The analyses which link the model to inference about a population are most straightforward when the line is fit through least squares.\n\nThe first two reasons are largely for tradition and convenience; the third and fourth reasons explain why the least squares criterion is typically most helpful when working with real data.[^07-model-slr-6]\n\n[^07-model-slr-6]: There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other criteria we might consider.\n However, this book only applies the least squares criterion.\n\n### Finding and interpreting the least squares line\n\nFor the Elmhurst data, we could write the equation of the least squares regression line as\n\n$$\n\\widehat{\\texttt{aid}} = \\beta_0 + \\beta_{1}\\times \\texttt{family_income}\n$$\n\nHere the equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst.\nThese two values, $\\beta_0$ and $\\beta_1,$ are the parameters of the regression line.\n\nThe parameters are estimated using the observed data.\nIn practice, this estimation is done using a computer in the same way that other estimates, like a sample mean, can be estimated using a computer or calculator.\n\nThe dataset where these data are stored is called `elmhurst`.\nThe first 5 rows of this dataset are given in @tbl-elmhurst-data.\n\n\n::: {#tbl-elmhurst-data .cell tbl-cap='First five rows of the `elmhurst` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
family_income gift_aid price_paid
92.92 21.7 14.28
0.25 27.5 8.53
53.09 27.8 14.25
50.20 27.2 8.78
137.61 18.0 24.00
\n\n`````\n:::\n:::\n\n\nWe can see that family income is recorded in a variable called `family_income` and gift aid from university is recorded in a variable called `gift_aid`.\nFor now, we won't worry about the `price_paid` variable.\nWe should also note that these data are from the 2011-2012 academic year, and all monetary amounts are given in \\$1,000s, i.e., the family income of the first student in the data shown in @tbl-elmhurst-data is \\$92,920 and they received a gift aid of \\$21,700.\n(The data source states that all numbers have been rounded to the nearest whole dollar.)\n\nStatistical software is usually used to compute the least squares line and the typical output generated as a result of fitting regression models looks like the one shown in @tbl-rOutputForIncomeAidLSRLine.\nFor now we will focus on the first column of the output, which lists ${b}_0$ and ${b}_1.$ In @sec-inf-model-slr we will dive deeper into the remaining columns which give us information on how accurate and precise these values of intercept and slope that are calculated from a sample of 50 students are in estimating the population parameters of intercept and slope for *all* students.\n\n\n::: {.cell}\n\n:::\n\n::: {#tbl-rOutputForIncomeAidLSRLine .cell tbl-cap='Summary of least squares fit for the Elmhurst data.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
term estimate std.error statistic p.value
(Intercept) 24.32 1.29 18.83 <0.0001
family_income -0.04 0.01 -3.98 2e-04
\n\n`````\n:::\n:::\n\n\nThe model output tells us that the intercept is approximately 24.319 and the slope on `family_income` is approximately -0.043.\n\nBut what do these values mean?\nInterpreting parameters in a regression model is often one of the most important steps in the analysis.\n\n::: {.workedexample data-latex=\"\"}\nThe intercept and slope estimates for the Elmhurst data are $b_0$ = 24.319 and $b_1$ = -0.043.\nWhat do these numbers really mean?\n\n------------------------------------------------------------------------\n\nInterpreting the slope parameter is helpful in almost any application.\nFor each additional \\$1,000 of family income, we would expect a student to receive a net difference of 1,000 $\\times$ (-0.0431) = -\\$43.10 in aid on average, i.e., \\$43.10 *less*.\nNote that a higher family income corresponds to less aid because the coefficient of family income is negative in the model.\nWe must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational.\nThat is, increasing a particular student's family income may not cause the student's aid to drop.\n(Although it would be reasonable to contact the college and ask if the relationship is causal, i.e., if Elmhurst College's aid decisions are partially based on students' family income.)\n\nThe estimated intercept $b_0$ = 24.319 describes the average aid if a student's family had no income, \\$24,319.\nThe meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \\$0.\nIn other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero.\n:::\n\n::: {.important data-latex=\"\"}\n**Interpreting parameters estimated by least squares.**\n\nThe slope describes the estimated difference in the predicted average outcome of $y$ if the predictor variable $x$ happened to be one unit larger.\nThe intercept describes the average outcome of $y$ if $x = 0$ *and* the linear model is valid all the way to $x = 0$ (values of $x = 0$ are not observed or relevant in many applications).\n:::\n\nIf you would like to learn more about using R to fit linear models, see @sec-model-tutorials for the interactive R tutorials.\nAn alternative way of calculating the values of intercept and slope of a least squares line is manual calculations using formulas.\nWhile manual calculations are not commonly used by practicing statisticians and data scientists, it is useful to work through the first time you're learning about the least squares line and modeling in general.\nCalculating the values by hand leverages two properties of the least squares line:\n\n1. The slope of the least squares line can be estimated by\n\n$$\nb_1 = \\frac{s_y}{s_x} r\n$$\n\nwhere $r$ is the correlation between the two variables, and $s_x$ and $s_y$ are the sample standard deviations of the predictor and outcome, respectively.\n\n2. If $\\bar{x}$ is the sample mean of the predictor variable and $\\bar{y}$ is the sample mean of the outcome variable, then the point $(\\bar{x}, \\bar{y})$ falls on the least squares line.\n\n@tbl-summaryStatsElmhurstRegr shows the sample means for the family income and gift aid as \\$101,780 and \\$19,940, respectively.\nWe could plot the point $(102, 19.9)$ on @fig-elmhurstScatterWLine to verify it falls on the least squares line (the solid line).\n\n\n::: {#tbl-summaryStatsElmhurstRegr .cell tbl-cap='Summary statistics for family income and gift aid.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n\n
Family income, x
Gift aid, y
mean sd mean sd r
102 63.2 19.9 5.46 -0.499
\n\n`````\n:::\n:::\n\n\nNext, we formally find the point estimates $b_0$ and $b_1$ of the parameters $\\beta_0$ and $\\beta_1.$\n\n::: {.workedexample data-latex=\"\"}\nUsing the summary statistics in @tbl-summaryStatsElmhurstRegr, compute the slope for the regression line of gift aid against family income.\n\n------------------------------------------------------------------------\n\nCompute the slope using the summary statistics from @tbl-summaryStatsElmhurstRegr:\n\n$$\nb_1 = \\frac{s_y}{s_x} r = \\frac{5.46}{63.2}(-0.499) = -0.0431\n$$\n:::\n\nYou might recall the form of a line from math class, which we can use to find the model fit, including the estimate of $b_0.$ Given the slope of a line and a point on the line, $(x_0, y_0),$ the equation for the line can be written as\n\n$$\ny - y_0 = slope\\times (x - x_0)\n$$\n\n::: {.important data-latex=\"\"}\n**Identifying the least squares line from summary statistics.**\n\nTo identify the least squares line from summary statistics:\n\n- Estimate the slope parameter, $b_1 = (s_y / s_x) r.$\n- Note that the point $(\\bar{x}, \\bar{y})$ is on the least squares line, use $x_0 = \\bar{x}$ and $y_0 = \\bar{y}$ with the point-slope equation: $y - \\bar{y} = b_1 (x - \\bar{x}).$\n- Simplify the equation, we get $y = \\bar{y} - b_1 \\bar{x} + b_1 x,$ which reveals that $b_0 = \\bar{y} - b_1 \\bar{x}.$\n:::\n\n::: {.workedexample data-latex=\"\"}\nUsing the point (102, 19.9) from the sample means and the slope estimate $b_1 = -0.0431,$ find the least-squares line for predicting aid based on family income.\n\n------------------------------------------------------------------------\n\nApply the point-slope equation using $(102, 19.9)$ and the slope $b_1 = -0.0431$:\n\n$$\n\\begin{aligned}\ny - y_0 &= b_1 (x - x_0) \\\\\ny - 19.9 &= -0.0431 (x - 102)\n\\end{aligned}\n$$\n\nExpanding the right side and then adding 19.9 to each side, the equation simplifies:\n\n$$\n\\widehat{\\texttt{aid}} = 24.3 - 0.0431 \\times \\texttt{family_income}\n$$\n\nHere we have replaced $y$ with $\\widehat{\\texttt{aid}}$ and $x$ with $\\texttt{family_income}$ to put the equation in context.\nThe final least squares equation should always include a \"hat\" on the variable being predicted, whether it is a generic $``y\"$ or a named variable like $``aid\"$.\n:::\n\n::: {.workedexample data-latex=\"\"}\nSuppose a high school senior is considering Elmhurst College.\nCan they simply use the linear equation that we have estimated to calculate her financial aid from the university?\n\n------------------------------------------------------------------------\n\nShe may use it as an estimate, though some qualifiers on this approach are important.\nFirst, the data all come from one freshman class, and the way aid is determined by the university may change from year to year.\nSecond, the equation will provide an imperfect estimate.\nWhile the linear equation is good at capturing the trend in the data, no individual student's aid will be perfectly predicted (as can be seen from the individual data points in the cloud around the line).\n:::\n\n### Extrapolation is treacherous\n\n> *When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6 it was 10 degrees.* *Today it hit almost 80.* *At this rate, by August it will be 220 degrees.* *So clearly folks the climate debate rages on.*[^07-model-slr-7]\n>\n> Stephen Colbert April 6th, 2010\n\n[^07-model-slr-7]: \n\nLinear models can be used to approximate the relationship between two variables.\nHowever, like any model, they have real limitations.\nLinear regression is simply a modeling framework.\nThe truth is almost always much more complex than a simple line.\nFor example, we do not know how the data outside of our limited window will behave.\n\n::: {.workedexample data-latex=\"\"}\nUse the model $\\widehat{\\texttt{aid}} = 24.3 - 0.0431 \\times \\texttt{family_income}$ to estimate the aid of another freshman student whose family had income of \\$1 million.\n\n------------------------------------------------------------------------\n\nWe want to calculate the aid for a family with \\$1 million income.\nNote that in our model this will be represented as 1,000 since the data are in \\$1,000s.\n\n$$\n24.3 - 0.0431 \\times 1000 = -18.8\n$$\n\nThe model predicts this student will have -\\$18,800 in aid (!).\nHowever, Elmhurst College does not offer *negative aid* where they select some students to pay extra on top of tuition to attend.\n:::\n\nApplying a model estimate to values outside of the realm of the original data is called **extrapolation**.\nGenerally, a linear model is only an approximation of the real relationship between two variables.\nIf we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.\n\n\n\n\n\n### Describing the strength of a fit {#sec-r-squared}\n\nWe evaluated the strength of the linear relationship between two variables earlier using the correlation, $r.$ However, it is more common to explain the strength of a linear fit using $R^2,$ called **R-squared**.\nIf provided with a linear model, we might like to describe how closely the data cluster around the linear fit.\n\n\n\n\n\nThe $R^2$ of a linear model describes the amount of variation in the outcome variable that is explained by the least squares line.\nFor example, consider the Elmhurst data, shown in @fig-elmhurstScatterWLine).\nThe variance of the outcome variable, aid received, is about $s_{aid}^2 \\approx 29.8$ million (calculated from the data, some of which is shown in @tbl-elmhurst-data).\nHowever, if we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student's family income.\nThe variability in the residuals describes how much variation remains after using the model: $s_{_{RES}}^2 \\approx 22.4$ million.\nIn short, there was a reduction of\n\n$$\n\\frac{s_{aid}^2 - s_{_{RES}}^2}{s_{aid}^2}\n = \\frac{29800 - 22400}{29800}\n = \\frac{7500}{29800}\n \\approx 0.25,\n$$\n\nor about 25%, of the outcome variable's variation by using information about family income for predicting aid using a linear model.\nIt turns out that $R^2$ corresponds exactly to the squared value of the correlation:\n\n$$\nr = -0.499 \\rightarrow R^2 = 0.25\n$$\n\n::: {.guidedpractice data-latex=\"\"}\nIf a linear model has a very strong negative relationship with a correlation of -0.97, how much of the variation in the outcome is explained by the predictor?[^07-model-slr-8]\n:::\n\n[^07-model-slr-8]: About $R^2 = (-0.97)^2 = 0.94$ or 94% of the variation in the outcome variable is explained by the linear model.\n\n$R^2$ is also called the **coefficient of determination**.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Coefficient of determination: proportion of variability in the outcome variable explained by the model.**\n\nSince $r$ is always between -1 and 1, $R^2$ will always be between 0 and 1.\nThis statistic is called the **coefficient of determination**, and it measures the proportion of variation in the outcome variable, $y,$ that can be explained by the linear model with predictor $x.$\n:::\n\nMore generally, $R^2$ can be calculated as a ratio of a measure of variability around the line divided by a measure of total variability.\n\n::: {.important data-latex=\"\"}\n**Sums of squares to measure variability in** $y.$\n\nWe can measure the variability in the $y$ values by how far they tend to fall from their mean, $\\bar{y}.$ We define this value as the **total sum of squares**, calculated using the formula below, where $y_i$ represents each $y$ value in the sample, and $\\bar{y}$ represents the mean of the $y$ values in the sample.\n\n$$\nSST = (y_1 - \\bar{y})^2 + (y_2 - \\bar{y})^2 + \\cdots + (y_n - \\bar{y})^2.\n$$\n\nLeft-over variability in the $y$ values if we know $x$ can be measured by the **sum of squared errors**, or sum of squared residuals, calculated using the formula below, where $\\hat{y}_i$ represents the predicted value of $y_i$ based on the least squares regression.[^07-model-slr-9],\n\n$$\n\\begin{aligned}\nSSE &= (y_1 - \\hat{y}_1)^2 + (y_2 - \\hat{y}_2)^2 + \\cdots + (y_n - \\hat{y}_n)^2\\\\\n&= e_{1}^2 + e_{2}^2 + \\dots + e_{n}^2\n\\end{aligned}\n$$\n\nThe coefficient of determination can then be calculated as\n\n$$\nR^2 = \\frac{SST - SSE}{SST} = 1 - \\frac{SSE}{SST}\n$$\n:::\n\n[^07-model-slr-9]: The difference $SST - SSE$ is called the **regression sum of squares**, $SSR,$ and can also be calculated as $SSR = (\\hat{y}_1 - \\bar{y})^2 + (\\hat{y}_2 - \\bar{y})^2 + \\cdots + (\\hat{y}_n - \\bar{y})^2.$ $SSR$ represents the variation in $y$ that was accounted for in our model.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nAmong 50 students in the `elmhurst` dataset, the total variability in gift aid is $SST = 1461$.[^07-model-slr-10]\nThe sum of squared residuals is $SSE = 1098.$ Find $R^2.$\n\n------------------------------------------------------------------------\n\nSince we know $SSE$ and $SST,$ we can calculate $R^2$ as\n\n$$\nR^2 = 1 - \\frac{SSE}{SST} = 1 - \\frac{1098}{1461} = 0.25,\n$$\n\nthe same value we found when we squared the correlation: $R^2 = (-0.499)^2 = 0.25.$\n:::\n\n[^07-model-slr-10]: $SST$ can be calculated by finding the sample variance of the outcome variable, $s^2$ and multiplying by $n-1.$\n\n### Categorical predictors with two levels {#sec-categorical-predictor-two-levels}\n\nCategorical variables are also useful in predicting outcomes.\nHere we consider a categorical predictor with two levels (recall that a *level* is the same as a *category*).\nWe'll consider Ebay auctions for a video game, *Mario Kart* for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded.\nHere we want to predict total price based on game condition, which takes values `used` and `new`.\n\n::: {.data data-latex=\"\"}\nThe [`mariokart`](http://openintrostat.github.io/openintro/reference/mariokart.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n\n:::\n\n\nA plot of the auction data is shown in @fig-marioKartNewUsed.\nNote that the original dataset contains some Mario Kart games being sold at prices above \\$100 but for this analysis we have limited our focus to the 141 Mario Kart games that were sold below \\$100.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total auction prices for the video game Mario Kart, divided into used\n($x = 0$) and new ($x = 1$) condition games. The least squares regression\nline is also shown.](07-model-slr_files/figure-html/fig-marioKartNewUsed-1.png){#fig-marioKartNewUsed width=90%}\n:::\n:::\n\n\nTo incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form.\nWe will do so using an **indicator variable** called `condnew`, which takes value 1 when the game is new and 0 when the game is used.\nUsing this indicator variable, the linear model may be written as\n\n$$\n\\widehat{\\texttt{price}} = b_0 + b_1 \\times \\texttt{condnew}\n$$\n\nThe parameter estimates are given in @tbl-marioKartNewUsedRegrSummary.\n\n\n\n\n::: {#tbl-marioKartNewUsedRegrSummary .cell tbl-cap='Least squares regression summary for the final auction price against the\ncondition of the game.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
term estimate std.error statistic p.value
(Intercept) 42.9 0.81 52.67 <0.0001
condnew 10.9 1.26 8.66 <0.0001
\n\n`````\n:::\n:::\n\n\nUsing values from Table @tbl-marioKartNewUsedRegrSummary, the model equation can be summarized as\n\n$$\n\\widehat{\\texttt{price}} = 42.87 + 10.9 \\times \\texttt{condnew}\n$$\n\n::: {.workedexample data-latex=\"\"}\nInterpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.\n\n------------------------------------------------------------------------\n\nThe intercept is the estimated price when `condnew` has a value 0, i.e., when the game is in used condition.\nThat is, the average selling price of a used version of the game is \\$42.9.\nThe slope indicates that, on average, new games sell for about \\$10.9 more than used games.\n:::\n\n::: {.important data-latex=\"\"}\n**Interpreting model estimates for categorical predictors.**\n\nThe estimated intercept is the value of the outcome variable for the first category (i.e., the category corresponding to an indicator value of 0).\nThe estimated slope is the average change in the outcome variable between the two categories.\n:::\n\nNote that, fundamentally, the intercept and slope interpretations do not change when modeling categorical variables with two levels.\nHowever, when the predictor variable is binary, the coefficient estimates ($b_0$ and $b_1$) are directly interpretable with respect to the dataset at hand.\n\nWe'll elaborate further on modeling categorical predictors in @sec-model-mlr, where we examine the influence of many predictor variables simultaneously using multiple regression.\n\n## Outliers in linear regression {#outliers-in-regression}\n\nIn this section, we identify criteria for determining which outliers are important and influential.\nOutliers in regression are observations that fall far from the cloud of points.\nThese points are especially important because they can have a strong influence on the least squares line.\n\n::: {.workedexample data-latex=\"\"}\nThere are three plots shown in @fig-outlier-plots-1 along with the corresponding least squares line and residual plots.\nFor each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.\nRecall that an outlier is any point that does not appear to belong with the vast majority of the other points.\n\n------------------------------------------------------------------------\n\n- A: There is one outlier far from the other points, though it only appears to slightly influence the line.\n\n- B: There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn't very influential.\n\n- C: There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud does not appear to fit very well.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three plots, each with a least squares line and corresponding residual plot.\nEach dataset has at least one outlier.\n](07-model-slr_files/figure-html/fig-outlier-plots-1-1.png){#fig-outlier-plots-1 width=100%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nThere are three plots shown in @fig-outlier-plots-2 along with the least squares line and residual plots.\nAs you did in previous exercise, for each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.\nRecall that an outlier is any point that does not appear to belong with the vast majority of the other points.\n\n------------------------------------------------------------------------\n\n- D: There is a primary cloud and then a small secondary cloud of four outliers.\n The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere.\n There might be an interesting explanation for the dual clouds, which is something that could be investigated.\n\n- E: There is no obvious trend in the main cloud of points and the outlier on the right appears to largely (and problematically) control the slope of the least squares line.\n\n- F: There is one outlier far from the cloud.\n However, it falls quite close to the least squares line and does not appear to be very influential.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three plots, each with a least squares line and residual plot.\nAll datasets have at least one outlier.\n](07-model-slr_files/figure-html/fig-outlier-plots-2-1.png){#fig-outlier-plots-2 width=100%}\n:::\n:::\n\n\nExamine the residual plots in Figures @fig-outlier-plots-1 and @fig-outlier-plots-2.\nIn Plots C, D, and E, you will probably find that there are a few observations which are both away from the remaining points along the x-axis and not in the trajectory of the trend in the rest of the data.\nIn these cases, the outliers influenced the slope of the least squares lines.\nIn Plot E, the bulk of the data show no clear trend, but if we fit a line to these data, we impose a trend where there isn't really one.\n\n::: {.important data-latex=\"\"}\n**Leverage.**\n\nPoints that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with **high leverage** or **leverage points**.\n:::\n\nPoints that fall horizontally far from the line are points of high leverage; these points can strongly influence the slope of the least squares line.\nIf one of these high leverage points does appear to actually invoke its influence on the slope of the line -- as in Plots C, D, and E of @fig-outlier-plots-1 and @fig-outlier-plots-2 -- then we call it an **influential point**.\nUsually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Types of outliers.**\n\nA point (or a group of points) that stands out from the rest of the data is called an outlier.\nOutliers that fall horizontally away from the center of the cloud of points are called leverage points.\nOutliers that influence on the slope of the line are called influential points.\n:::\n\nIt is tempting to remove outliers.\nDon't do this without a very good reason.\nModels that ignore exceptional (and interesting) cases often perform poorly.\nFor instance, if a financial firm ignored the largest market swings -- the \"outliers\" -- they would soon go bankrupt by making poorly thought-out investments.\n\n\\clearpage\n\n## Chapter review {#chp7-review}\n\n### Summary\n\nThroughout this chapter, the nuances of the linear model have been described.\nYou have learned how to create a linear model with explanatory variables that are numerical (e.g., total possum length) and those that are categorical (e.g., whether a video game was new).\nThe residuals in a linear model are an important metric used to understand how well a model fits; high leverage points, influential points, and other types of outliers can impact the fit of a model.\nCorrelation is a measure of the strength and direction of the linear relationship of two variables, without specifying which variable is the explanatory and which is the outcome.\nFuture chapters will focus on generalizing the linear model from the sample of data to claims about the population of interest.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
coefficient of determination influential point predictor
correlation least squares line R-squared
extrapolation leverage point residuals
high leverage outcome sum of squared error
indicator variable outlier total sum of squares
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp7-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-07].\n\n::: {.exercises data-latex=\"\"}\n1. **Visualize the residuals.** \nThe scatterplots shown below each have a superimposed regression line. If we were to construct a residual plot (residuals versus $x$) for each, describe in words what those plots would look like.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-38-1.png){width=90%}\n :::\n :::\n \n1. **Trends in the residuals.** \nShown below are two plots of residuals remaining after fitting a linear model to two different sets of data. \nFor each plot, describe important features and determine if a linear model would be appropriate for these data. \nExplain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n \n1. **Identify relationships, I.** \nFor each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-40-1.png){width=90%}\n :::\n :::\n\n1. **Identify relationships, II.** \nFor each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-41-1.png){width=90%}\n :::\n :::\n\n1. **Midterms and final.** \nThe two scatterplots below show the relationship between the overall course average and two midterm exams (Exam 1 and Exam 2) recorded for 233 students during several years for a statistics course at a university.^[The [`exam_grades`](http://openintrostat.github.io/openintro/reference/exam_grades.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-42-1.png){width=90%}\n :::\n :::\n\n a. Based on these graphs, which of the two exams has the strongest correlation with the course grade? Explain.\n\n b. Can you think of a reason why the correlation between the exam you chose in part (a) and the course grade is higher?\n \n \\clearpage\n\n1. **Partners' ages and heights.** \nThe Great Britain Office of Population Census and Surveys collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the partners. The scatterplot on the left shows the heights of the partners plotted against each other and the plot on the right shows the ages of the partners plotted against each other.^[The [`husbands_wives`](http://openintrostat.github.io/openintro/reference/husbands_wives.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-43-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between partners' ages.\n\n b. Describe the relationship between partners' heights.\n\n c. Which plot shows a stronger correlation? Explain your reasoning.\n\n d. Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between partners' heights?\n\n1. **Match the correlation, I.** \nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-44-1.png){width=90%}\n :::\n :::\n\n a. $r = -0.7$\n\n b. $r = 0.45$\n\n c. $r = 0.06$\n\n d. $r = 0.92$\n\n1. **Match the correlation, II.** \nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-45-1.png){width=90%}\n :::\n :::\n\n a. $r = 0.49$\n\n b. $r = -0.48$\n\n c. $r = -0.03$\n\n d. $r = -0.85$\n\n1. **Body measurements, correlation.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nThe scatterplot below shows the relationship between height and shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-46-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between shoulder girth and height.\n\n b. How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?\n\n1. **Compare correlations.** \nEduardo and Rosie are both collecting data on number of rainy days in a year and the total rainfall for the year. Eduardo records rainfall in inches and Rosie in centimeters. How will their correlation coefficients compare?\n\n1. **The Coast Starlight, correlation.** \nThe Coast Starlight Amtrak train runs from Seattle to Los Angeles. \nThe scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).^[The [`coast_starlight`](http://openintrostat.github.io/openintro/reference/coast_starlight.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-47-1.png){width=70%}\n :::\n :::\n\n a. Describe the relationship between distance and travel time.\n\n b. How would the relationship change if travel time was instead measured in hours, and distance was instead measured in kilometers?\n\n c. Correlation between travel time (in miles) and distance (in minutes) is $r = 0.636$. What is the correlation between travel time (in kilometers) and distance (in hours)?\n\n1. **Crawling babies, correlation.** \nA study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. \nInfants born during the study year were split into twelve groups, one for each birth month. \nWe consider the average crawling age of babies in each group against the average temperature when the babies are six months old (that's when babies often begin trying to crawl). \nTemperature is measured in degrees Fahrenheit (F) and age is measured in weeks.^[The [`babies_crawl`](http://openintrostat.github.io/openintro/reference/babies_crawl.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Benson:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-48-1.png){width=70%}\n :::\n :::\n\n a. Describe the relationship between temperature and crawling age.\n\n b. How would the relationship change if temperature was measured in degrees Celsius (C) and age was measured in months?\n\n c. The correlation between temperature in F and age in weeks was $r=-0.70$. If we converted the temperature to C and age to months, what would the correlation be?\n\n1. **Partners' ages.** \nWhat would be the correlation between the ages of partners if people always dated others who are \n\n a. 3 years younger than themselves?\n\n b. 2 years older than themselves?\n\n c. half as old as themselves?\n\n1. **Graduate degrees and salaries.** \nWhat would be the correlation between the annual salaries of people with and without a graduate degree at a company if for a certain type of position someone with a graduate degree always made \n\n a. \\$5,000 more than those without a graduate degree?\n\n b. 25% more than those without a graduate degree?\n\n c. 15% less than those without a graduate degree?\n\n1. **Units of regression.** \nConsider a regression predicting the number of calories (cal) from width (cm) for a sample of square shaped chocolate brownies. What are the units of the correlation coefficient, the intercept, and the slope?\n\n1. **Which is higher?**\nDetermine if (I) or (II) is higher or if they are equal: *\"For a regression line, the uncertainty associated with the slope estimate, $b_1$, is higher when (I) there is a lot of scatter around the regression line or (II) there is very little scatter around the regression line.\"* Explain your reasoning.\n\n1. **Over-under, I.** \nSuppose we fit a regression line to predict the shelf life of an apple based on its weight. For a particular apple, we predict the shelf life to be 4.6 days. The apple's residual is -0.6 days. Did we over or under estimate the shelf-life of the apple? Explain your reasoning.\n\n1. **Over-under, II.** \nSuppose we fit a regression line to predict the number of incidents of skin cancer per 1,000 people from the number of sunny days in a year. \nFor a particular year, we predict the incidence of skin cancer to be 1.5 per 1,000 people, and the residual for this year is 0.5. \nDid we over or under estimate the incidence of skin cancer? Explain your reasoning.\n\n1. **Starbucks, calories, and protein.** \nThe scatterplot below shows the relationship between the number of calories and amount of protein (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we might be interested in predicting the amount of protein a menu item has based on its calorie content.^[The [`starbucks`](http://openintrostat.github.io/openintro/reference/starbucks.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-49-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between number of calories and amount of protein (in grams) that Starbucks food menu items contain.\n\n b. In this scenario, what are the predictor and outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. What does the residuals vs. predicted plot tell us about the variability in our prediction errors based on this model for items with lower vs. higher predicted protein?\n\n1. **Starbucks, calories, and carbs.** \nThe scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we might be interested in predicting the amount of carbs a menu item has based on its calorie content.^[The [`starbucks`](http://openintrostat.github.io/openintro/reference/starbucks.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-50-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.\n\n b. In this scenario, what are the predictor and outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. What does the residuals vs. predicted plot tell us about the variability in our prediction errors based on this model for items with lower vs. higher predicted carbs?\n\n1. **The Coast Starlight, regression.** \nThe Coast Starlight Amtrak train runs from Seattle to Los Angeles. \nThe scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).\nThe mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. \nThe correlation between travel time and distance is 0.636.^[The [`coast_starlight`](http://openintrostat.github.io/openintro/reference/coast_starlight.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-51-1.png){width=90%}\n :::\n :::\n\n a. Write the equation of the regression line for predicting travel time.\n\n b. Interpret the slope and the intercept in this context.\n\n c. Calculate $R^2$ of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret $R^2$ in the context of the application.\n\n d. The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.\n\n e. It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.\n\n f. Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?\n \n \\clearpage\n\n1. **Body measurements, regression.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nThe scatterplot below shows the relationship between height and shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.\nThe mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. \nThe mean height is 171.14 cm with a standard deviation of 9.41 cm. \nThe correlation between height and shoulder girth is 0.67.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-52-1.png){width=90%}\n :::\n :::\n\n a. Write the equation of the regression line for predicting height.\n\n b. Interpret the slope and the intercept in this context.\n\n c. Calculate $R^2$ of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.\n\n d. A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.\n\n e. The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.\n\n f. A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?\n \n \\clearpage\n\n1. **Poverty and unemployment.** \nThe following scatterplot shows the relationship between percent of population below the poverty level (`poverty`) from unemployment rate among those ages 20-64 (`unemployment_rate`) in counties in the US, as provided by data from the 2019 American Community Survey. \nThe regression output for the model for predicting `poverty` from `unemployment_rate` is also provided.^[The [`\ncounty_2019`](http://openintrostat.github.io/usdata/reference/\ncounty_2019.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-53-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 4.60 0.349 13.2 <0.0001
unemployment_rate 2.05 0.062 33.1 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the linear model.\n\n b. Interpret the intercept.\n\n c. Interpret the slope.\n\n d. For this model $R^2$ is 46%. Interpret this value.\n\n e. Calculate the correlation coefficient.\n \n \\clearpage\n\n1. **Cats weights.** \nThe following regression output is for predicting the heart weight (`Hwt`, in g) of cats from their body weight (`Bwt`, in kg). The coefficients are estimated using a dataset of 144 domestic cats.^[The [`cats`](https://cran.r-project.org/web/packages/MASS/MASS.pdf) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-54-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.357 0.692 -0.515 0.6072
Bwt 4.034 0.250 16.119 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the linear model.\n\n b. Interpret the intercept.\n\n c. Interpret the slope.\n\n d. The $R^2$ of this model is 1%. Interpret $R^2$.\n\n e. Calculate the correlation coefficient.\n\n1. **Outliers, I.** \nIdentify the outliers in the scatterplots shown below, and determine what type of outliers they are. Explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-55-1.png){width=100%}\n :::\n :::\n \n \\clearpage\n\n1. **Outliers, II.** \nIdentify the outliers in the scatterplots shown below and determine what type of outliers they are. Explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-56-1.png){width=100%}\n :::\n :::\n\n1. **Urban homeowners, outliers.** \nThe scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. \nThere are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.^[The [`urban_owner`](http://openintrostat.github.io/openintro/reference/urban_owner.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-57-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between the percent of families who own their home and the percent of the population living in urban areas.\n\n b. The outlier at the bottom right corner is District of Columbia, where 100% of the population is considered urban. What type of an outlier is this observation?\n \n \\pagebreak\n\n1. **Crawling babies, outliers.**\nA study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. \nThe plot below shows the relationship between average crawling age of babies born in each month and the average temperature in the month when the babies are six months old.\nThe plot reveals a potential outlying month when the average temperature is about 53F and average crawling age is about 28.5 weeks. \nDoes this point have high leverage? Is it an influential point?^[The [`babies_crawl`](http://openintrostat.github.io/openintro/reference/babies_crawl.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Benson:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-58-1.png){width=90%}\n :::\n :::\n\n1. **True / False.** \nDetermine if the following statements are true or false. \nIf false, explain why.\n\n a. A correlation coefficient of -0.90 indicates a stronger linear relationship than a correlation of 0.5.\n\n b. Correlation is a measure of the association between any two variables.\n\n1. **Cherry trees.** \nThe scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. \nThe diameter of the tree is measured 4.5 feet above the ground.^[The [`trees`](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/trees.html) data used in this exercise can be found in the [**datasets**](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-59-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between volume and height of these trees.\n\n b. Describe the relationship between volume and diameter of these trees.\n\n c. Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.\n\n1. **Match the correlation, III.**\nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-60-1.png){width=100%}\n :::\n :::\n \n a. r = 0.69\n\n b. r = 0.09\n\n c. r = -0.91\n\n d. r = 0.97\n\n1. **Helmets and lunches.** \nThe scatterplot shows the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (`lunch`) and the percentage of bike riders in the neighborhood wearing helmets (`helmet`). \nThe average percentage of children receiving reduced-fee lunches is 30.833% with a standard deviation of 26.724% and the average percentage of bike riders wearing helmets is 30.883% with a standard deviation of 16.948%.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-61-1.png){width=90%}\n :::\n :::\n\n a. If the $R^2$ for the least-squares regression line for these data is 72%, what is the correlation between `lunch` and `helmet`?\n\n b. Calculate the slope and intercept for the least-squares regression line for these data.\n\n c. Interpret the intercept of the least-squares regression line in the context of the application.\n\n d. Interpret the slope of the least-squares regression line in the context of the application.\n\n e. What would the value of the residual be for a neighborhood where 40% of the children receive reduced-fee lunches and 40% of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.\n\n\n:::\n", + "markdown": "---\noutput: html_document\neditor_options: \n chunk_output_type: console\n---\n\n\n\n\n# Linear regression with a single predictor {#sec-model-slr}\n\n::: {.chapterintro data-latex=\"\"}\nLinear regression is a very powerful statistical technique.\nMany people have some familiarity with regression models just from reading the news, where straight lines are overlaid on scatterplots.\nLinear models can be used for prediction or to evaluate whether there is a linear relationship between a numerical variable on the horizontal axis and the average of the numerical variable on the vertical axis.\n:::\n\n## Fitting a line, residuals, and correlation {#fit-line-res-cor}\n\nWhen considering linear regression, it's helpful to think deeply about the line fitting process.\nIn this section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce a new statistic called *correlation*.\n\n### Fitting a line to data\n\n@fig-perfLinearModel shows two variables whose relationship can be modeled perfectly with a straight line.\nThe equation for the line is $y = 5 + 64.96 x.$ Consider what a perfect linear relationship means: we know the exact value of $y$ just by knowing the value of $x.$ A perfect linear relationship is unrealistic in almost any natural process.\nFor example, if we took family income ($x$), this value would provide some useful information about how much financial support a college may offer a prospective student ($y$).\nHowever, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's finances.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Requests from twelve separate buyers were simultaneously placed with a trading\ncompany to purchase Target Corporation stock (ticker TGT, December 28th, 2018),\nand the total cost of the shares were reported. Because the cost is computed using\na linear formula, the linear fit is perfect.](07-model-slr_files/figure-html/fig-perfLinearModel-1.png){#fig-perfLinearModel width=90%}\n:::\n:::\n\n\nLinear regression is the statistical method for fitting a line to data where the relationship between two variables, $x$ and $y,$ can be modeled by a straight line with some error:\n\n$$\ny = b_0 + b_1 \\ x + e\n$$\n\nThe values $b_0$ and $b_1$ represent the model's intercept and slope, respectively, and the error is represented by $e$.\nThese values are calculated based on the data, i.e., they are sample statistics.\nIf the observed data is a random sample from a target population that we are interested in making inferences about, these values are considered to be point estimates for the population parameters $\\beta_0$ and $\\beta_1$.\nWe will discuss how to make inferences about parameters of a linear model based on sample statistics in @sec-inf-model-slr.\n\n::: {.pronunciation data-latex=\"\"}\nThe Greek letter $\\beta$ is pronounced *beta*, listen to the pronunciation [here](https://youtu.be/PStgY5AcEIw?t=7).\n:::\n\nWhen we use $x$ to predict $y,$ we usually call $x$ the **predictor** variable and we call $y$ the **outcome**.\nWe also often drop the $e$ term when writing down the model since our main focus is often on the prediction of the average outcome.\n\n\n\n\n\nIt is rare for all of the data to fall perfectly on a straight line.\nInstead, it's more common for data to appear as a *cloud of points*, such as those examples shown in @fig-imperfLinearModel.\nIn each case, the data fall around a straight line, even if none of the observations fall exactly on the line.\nThe first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between $x$ and $y.$ The second plot shows an upward trend that, while evident, is not as strong as the first.\nThe last plot shows a very weak downward trend in the data, so slight we can hardly notice it.\nIn each of these examples, we will have some uncertainty regarding our estimates of the model parameters, $\\beta_0$ and $\\beta_1.$ For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less?\nAs we move forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three datasets where a linear model may be useful even though the data do\nnot all fall exactly on the line.\n](07-model-slr_files/figure-html/fig-imperfLinearModel-1.png){#fig-imperfLinearModel width=100%}\n:::\n:::\n\n\nThere are also cases where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful.\nOne such case is shown in @fig-notGoodAtAllForALinearModel where there is a very clear relationship between the variables even though the trend is not linear.\nWe discuss nonlinear trends in this chapter and the next, but details of fitting nonlinear models are saved for a later course.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The best fitting line for these data is flat, which is not a useful way to\ndescribe the non-linear relationship. These data are from a physics experiment.](07-model-slr_files/figure-html/fig-notGoodAtAllForALinearModel-1.png){#fig-notGoodAtAllForALinearModel width=90%}\n:::\n:::\n\n\n### Using linear regression to predict possum head lengths\n\nBrushtail possums are marsupials that live in Australia, and a photo of one is shown in @fig-brushtail-possum.\nResearchers captured 104 of these animals and took body measurements before releasing the animals back into the wild.\nWe consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum's head.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The common brushtail possum of Australia. Photo by Greg Schecter,\n[flic.kr/p/9BAFbR](https://flic.kr/p/9BAFbR), CC BY 2.0 license.\n](images/brushtail-possum/brushtail-possum.jpg){#fig-brushtail-possum fig-alt='Photograph of a common brushtail possum of Australia.' width=50%}\n:::\n:::\n\n\n::: {.data data-latex=\"\"}\nThe [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n@fig-scattHeadLTotalL shows a scatterplot for the head length (mm) and total length (cm) of the possums.\nEach point represents a single possum from the data.\nThe head and total length variables are associated: possums with an above average total length also tend to have above average head lengths.\nWhile the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A scatterplot showing head length against total length for 104 brushtail\npossums. A point representing a possum with head length 86.7 mm and total\nlength 84 cm is highlighted.](07-model-slr_files/figure-html/fig-scattHeadLTotalL-1.png){#fig-scattHeadLTotalL width=90%}\n:::\n:::\n\n\nWe want to describe the relationship between the head length and total length variables in the possum dataset using a line.\nIn this example, we will use the total length as the predictor variable, $x,$ to predict a possum's head length, $y.$ We could fit the linear relationship by eye, as in @fig-scattHeadLTotalLLine.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A reasonable linear model was fit to represent the relationship between\nhead length and total length.](07-model-slr_files/figure-html/fig-scattHeadLTotalLLine-1.png){#fig-scattHeadLTotalLLine width=90%}\n:::\n:::\n\n\nThe equation for this line is\n\n$$\n\\hat{y} = 41 + 0.59x\n$$\n\nA \"hat\" on $y$ is used to signify that this is an estimate.\nWe can use this line to discuss properties of possums.\nFor instance, the equation predicts a possum with a total length of 80 cm will have a head length of\n\n$$\n\\hat{y} = 41 + 0.59 \\times 80 = 88.2\n$$\n\nThe estimate may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.2 mm.\nAbsent further information about an 80 cm possum, the prediction for head length that uses the average is a reasonable estimate.\n\nThere may be other variables that could help us predict the head length of a possum besides its length.\nPerhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region.\nPlot A in @fig-scattHeadLTotalL-sex-age shows the relationship between total length and head length of brushtail possums, taking into consideration their sex.\nMale possums (represented by blue triangles) seem to be larger in terms of total length and head length than female possums (represented by red circles).\nPlot B in @fig-scattHeadLTotalL-sex-age shows the same relationship, taking into consideration their age.\nIt's harder to tell if age changes the relationship between total length and head length for these possums.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationship between total length and head length of brushtail possums,\ntaking into consideration their sex (Plot A) or age (Plot B).\n](07-model-slr_files/figure-html/fig-scattHeadLTotalL-sex-age-1.png){#fig-scattHeadLTotalL-sex-age width=90%}\n:::\n:::\n\n\nIn @sec-model-mlr, we'll learn about how we can include more than one predictor in our model.\nBefore we get there, we first need to better understand how to best build a linear model with one predictor.\n\n### Residuals {#resids}\n\n**Residuals** are the leftover variation in the data after accounting for the model fit:\n\n$$\n\\text{Data} = \\text{Fit} + \\text{Residual}\n$$\n\nEach observation will have a residual, and three of the residuals for the linear model we fit for the possum data are shown in @fig-scattHeadLTotalLLine-highlighted.\nIf an observation is above the regression line, then its residual, the vertical distance from the observation to the line, is positive.\nObservations below the line have negative residuals.\nOne goal in picking the right linear model is for these residuals to be as small as possible.\n\n\n\n\n\n@fig-scattHeadLTotalLLine-highlighted is almost a replica of @fig-scattHeadLTotalLLine, with three points from the data highlighted.\nThe observation marked by a red circle has a small, negative residual of about -1; the observation marked by a gray diamond has a large positive residual of about +7; and the observation marked by a pink triangle has a moderate negative residual of about -4.\nThe size of a residual is usually discussed in terms of its absolute value.\nFor example, the residual for the observation marked by a pink triangle is larger than that of the observation marked by a red circle because $|-4|$ is larger than $|-1|.$\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A reasonable linear model was fit to represent the relationship between\nhead length and total length, with three points highlighted.](07-model-slr_files/figure-html/fig-scattHeadLTotalLLine-highlighted-1.png){#fig-scattHeadLTotalLLine-highlighted width=90%}\n:::\n:::\n\n\n::: {.important data-latex=\"\"}\n**Residual: Difference between observed and expected.**\n\nThe residual of the $i^{th}$ observation $(x_i, y_i)$ is the difference of the observed outcome ($y_i$) and the outcome we would predict based on the model fit ($\\hat{y}_i$):\n\n$$\ne_i = y_i - \\hat{y}_i\n$$\n\nWe typically identify $\\hat{y}_i$ by plugging $x_i$ into the model.\n:::\n\n::: {.workedexample data-latex=\"\"}\nThe linear fit shown in @fig-scattHeadLTotalLLine-highlighted is given as $\\hat{y} = 41 + 0.59x.$ Based on this line, formally compute the residual of the observation $(76.0, 85.1).$ This observation is marked by a red circle in @fig-scattHeadLTotalLLine-highlighted.\nCheck it against the earlier visual estimate, -1.\n\n------------------------------------------------------------------------\n\nWe first compute the predicted value of the observation marked by a red circle based on the model:\n\n$$\n\\hat{y} = 41+0.59x = 41+0.59\\times 76.0 = 85.84\n$$\n\nNext we compute the difference of the actual head length and the predicted head length:\n\n$$\ne = y - \\hat{y} = 85.1 - 85.84 = -0.74\n$$\n\nThe model's error is $e = -0.74$ mm, which is very close to the visual estimate of -1 mm.\nThe negative residual indicates that the linear model overpredicted head length for this particular possum.\n:::\n\n::: {.guidedpractice data-latex=\"\"}\nIf a model underestimates an observation, will the residual be positive or negative?\nWhat about if it overestimates the observation?[^07-model-slr-1]\n:::\n\n[^07-model-slr-1]: If a model underestimates an observation, then the model estimate is below the actual.\n The residual, which is the actual observation value minus the model estimate, must then be positive.\n The opposite is true when the model overestimates the observation: the residual is negative.\n\n::: {.guidedpractice data-latex=\"\"}\nCompute the residuals for the observation marked by a blue diamond, $(85.0, 98.6),$ and the observation marked by a pink triangle, $(95.5, 94.0),$ in the figure using the linear relationship $\\hat{y} = 41 + 0.59x.$[^07-model-slr-2]\n:::\n\n[^07-model-slr-2]: Gray diamond: $\\hat{y} = 41+0.59x = 41+0.59\\times 85.0 = 91.15 \\rightarrow e = y - \\hat{y} = 98.6-91.15=7.45.$ This is close to the earlier estimate of 7.\n pink triangle: $\\hat{y} = 41+0.59x = 97.3 \\rightarrow e = -3.3.$ This is also close to the estimate of -4.\n\nResiduals are helpful in evaluating how well a linear model fits a dataset.\nWe often display them in a scatterplot such as the one shown in @fig-scattHeadLTotalLResidualPlot for the regression line in @fig-scattHeadLTotalLLine-highlighted.\nThe residuals are plotted with their predicted outcome variable value as the horizontal coordinate, and the vertical coordinate as the residual.\nFor instance, the point $(85.0, 98.6)$ (marked by the blue diamond) had a predicted value of 91.4 mm and had a residual of 7.45 mm, so in the residual plot it is placed at $(91.4, 7.45).$ Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal, as indicated by the dashed line.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Residual plot for the model predicting head length from total length for\nbrushtail possums.](07-model-slr_files/figure-html/fig-scattHeadLTotalLResidualPlot-1.png){#fig-scattHeadLTotalLResidualPlot width=90%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nOne purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model.\n@fig-sampleLinesAndResPlots shows three scatterplots with linear models in the first row and residual plots in the second row.\nCan you identify any patterns remaining in the residuals?\n\n------------------------------------------------------------------------\n\nIn the first dataset (first column), the residuals show no obvious patterns.\nThe residuals appear to be scattered randomly around the dashed line that represents 0.\n\nThe second dataset shows a pattern in the residuals.\nThere is some curvature in the scatterplot, which is more obvious in the residual plot.\nWe should not use a straight line to model these data.\nInstead, a more advanced technique should be used to model the curved relationship, such as the variable transformations discussed in @sec-transforming-data.\n\nThe last plot shows very little upwards trend, and the residuals also show no obvious patterns.\nIt is reasonable to try to fit a linear model to the data.\nHowever, it is unclear whether there is evidence that the slope parameter is different from zero.\nThe point estimate of the slope parameter, labeled $b_1,$ is not zero, but we might wonder if this could just be due to chance.\nWe will address this sort of scenario in @sec-inf-model-slr.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample data with their best fitting lines (top row) and their corresponding\nresidual plots (bottom row).](07-model-slr_files/figure-html/fig-sampleLinesAndResPlots-1.png){#fig-sampleLinesAndResPlots width=90%}\n:::\n:::\n\n\n\\clearpage\n\n### Describing linear relationships with correlation\n\nWe've seen plots with strong linear relationships and others with very weak linear relationships.\nIt would be useful if we could quantify the strength of these linear relationships with a statistic.\n\n::: {.important data-latex=\"\"}\n**Correlation: strength of a linear relationship.**\n\n**Correlation** which always takes values between -1 and 1, describes the strength and direction of the linear relationship between two variables.\nWe denote the correlation by $r.$\n\nThe correlation value has no units and will not be affected by a linear change in the units (e.g., going from inches to centimeters).\n:::\n\n\n\n\n\nWe can compute the correlation using a formula, just as we did with the sample mean and standard deviation.\nThe formula for correlation, however, is rather complex[^07-model-slr-3], and like with other statistics, we generally perform the calculations on a computer or calculator.\n\n[^07-model-slr-3]: Formally, we can compute the correlation for observations $(x_1, y_1),$ $(x_2, y_2),$ ..., $(x_n, y_n)$ using the formula\n\n$$\nr = \\frac{1}{n-1} \\sum_{i=1}^{n} \\frac{x_i-\\bar{x}}{s_x}\\frac{y_i-\\bar{y}}{s_y}\n$$\n\nwhere $\\bar{x},$ $\\bar{y},$ $s_x,$ and $s_y$ are the sample means and standard deviations for each variable.\n\n@fig-posNegCorPlots shows eight plots and their corresponding correlations.\nOnly when the relationship is perfectly linear is the correlation either -1 or 1.\nIf the relationship is strong and positive, the correlation will be near +1.\nIf it is strong and negative, it will be near -1.\nIf there is no apparent linear relationship between the variables, then the correlation will be near zero.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample scatterplots and their correlations. The first row shows variables\nwith a positive relationship, represented by the trend up and to the right.\nThe second row shows variables with a negative trend, where a large value\nin one variable is associated with a lower value in the other.\n](07-model-slr_files/figure-html/fig-posNegCorPlots-1.png){#fig-posNegCorPlots width=100%}\n:::\n:::\n\n\nThe correlation is intended to quantify the strength of a linear trend.\nNonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in @fig-corForNonLinearPlots.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Sample scatterplots and their correlations. In each case, there is a strong\nrelationship between the variables. However, because the relationship is\nnot linear, the correlation is relatively weak.\n](07-model-slr_files/figure-html/fig-corForNonLinearPlots-1.png){#fig-corForNonLinearPlots width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nNo straight line is a good fit for any of the datasets represented in Figure @fig-corForNonLinearPlots.\nTry drawing nonlinear curves on each plot.\nOnce you create a curve for each, describe what is important in your fit.[^07-model-slr-4]\n:::\n\n[^07-model-slr-4]: We'll leave it to you to draw the lines.\n In general, the lines you draw should be close to most points and reflect overall trends in the data.\n\n::: {.workedexample data-latex=\"\"}\n@fig-crop-yields-af displays the relationships between various crop yields in countries.\nIn the plots, each point represents a different country.\nThe x and y variables represent the proportion of total yield in the last 50 years which is due to that crop type.\n\nOrder the six scatterplots from strongest negative to strongest positive linear relationship.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Relationships between various crop yields in over 200 countries.\n](07-model-slr_files/figure-html/fig-crop-yields-af-1.png){#fig-crop-yields-af width=100%}\n:::\n:::\n\n\n------------------------------------------------------------------------\n\nThe order of most negative correlation to most positive correlation is:\n\n$$\nA \\rightarrow D \\rightarrow B \\rightarrow C \\rightarrow E \\rightarrow F\n$$\n\n- Plot A - bananas vs. potatoes: -0.62\n- Plot B - cassava vs. soybeans: -0.21\n- Plot C - cassava vs. maize: -0.26\n- Plot D - cocoa vs. bananas: 0.22\n- Plot E - peas vs. barley: 0.31\n- Plot F - wheat vs. barley: 0.21\n:::\n\nOne important aspect of the correlation is that it's *unitless*.\nThat is, unlike a measurement of the slope of a line (see the next section) which provides an increase in the y-coordinate for a one unit increase in the x-coordinate (in units of the x and y variable), there are no units associated with the correlation of x and y.\n@fig-bdims-units shows the relationship between weights and heights of 507 physically active individuals.\nIn Plot A, weight is measured in kilograms (kg) and height in centimeters (cm).\nIn Plot B, weight has been converted to pounds (lbs) and height to inches (in).\nThe correlation coefficient ($r = 0.72$) is also noted on both plots.\nWe can see that the shape of the relationship has not changed, and neither has the correlation coefficient.\nThe only visual change to the plot is the axis *labeling* of the points.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots, both displaying the relationship between weights and\nheights of 507 physically healthy adults. In Plot A, the units are\nkilograms and centimeters. In Plot B, the units are pounds and inches.\nAlso noted on both plots is the correlation coefficient, $r = 0.72$.\n](07-model-slr_files/figure-html/fig-bdims-units-1.png){#fig-bdims-units width=90%}\n:::\n:::\n\n\n## Least squares regression {#sec-least-squares-regression}\n\nFitting linear models by eye is open to criticism since it is based on an individual's preference.\nIn this section, we use *least squares regression* as a more rigorous approach to fitting a line to a scatterplot.\n\n### Gift aid for freshman at Elmhurst College\n\nThis section considers a dataset on family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois.\nGift aid is financial aid that does not need to be paid back, as opposed to a loan.\nA scatterplot of these data is shown in @fig-elmhurstScatterWLine along with a linear fit.\nThe line follows a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Gift aid and family income for a random sample of 50 freshman students from\nElmhurst College.](07-model-slr_files/figure-html/fig-elmhurstScatterWLine-1.png){#fig-elmhurstScatterWLine width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nIs the correlation positive or negative in @fig-elmhurstScatterWLine?[^07-model-slr-5]\n:::\n\n[^07-model-slr-5]: Larger family incomes are associated with lower amounts of aid, so the correlation will be negative.\n Using a computer, the correlation can be computed: -0.499.\n\n### An objective measure for finding the best line\n\nWe begin by thinking about what we mean by the \"best\" line.\nMathematically, we want a line that has small residuals.\nBut beyond the mathematical reasons, hopefully it also makes sense intuitively that whatever line we fit, the residuals should be small (i.e., the points should be close to the line).\nThe first option that may come to mind is to minimize the sum of the residual magnitudes:\n\n$$\n|e_1| + |e_2| + \\dots + |e_n|\n$$\n\nwhich we could accomplish with a computer program.\nThe resulting dashed line shown in @fig-elmhurstScatterW2Lines demonstrates this fit can be quite reasonable.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Gift aid and family income for a random sample of 50 freshman students from\nElmhurst College. The dashed line represents the line that minimizes the sum of\nthe absolute value of residuals, the solid line represents the line that minimizes\nthe sum of squared residuals, i.e., the least squares line.](07-model-slr_files/figure-html/fig-elmhurstScatterW2Lines-1.png){#fig-elmhurstScatterW2Lines width=90%}\n:::\n:::\n\n\nHowever, a more common practice is to choose the line that minimizes the sum of the squared residuals:\n\n$$\ne_{1}^2 + e_{2}^2 + \\dots + e_{n}^2\n$$\n\nThe line that minimizes this least squares criterion is represented as the solid line in @fig-elmhurstScatterW2Lines and is commonly called the **least squares line**.\nThe following are three possible reasons to choose the least squares option instead of trying to minimize the sum of residual magnitudes without any squaring:\n\n\n\n\n\n1. It is the most commonly used method.\n2. Computing the least squares line is widely supported in statistical software.\n3. In many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.\n4. The analyses which link the model to inference about a population are most straightforward when the line is fit through least squares.\n\nThe first two reasons are largely for tradition and convenience; the third and fourth reasons explain why the least squares criterion is typically most helpful when working with real data.[^07-model-slr-6]\n\n[^07-model-slr-6]: There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other criteria we might consider.\n However, this book only applies the least squares criterion.\n\n### Finding and interpreting the least squares line\n\nFor the Elmhurst data, we could write the equation of the least squares regression line as\n\n$$\n\\widehat{\\texttt{aid}} = \\beta_0 + \\beta_{1}\\times \\texttt{family\\_income}\n$$\n\nHere the equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst.\nThese two values, $\\beta_0$ and $\\beta_1,$ are the parameters of the regression line.\n\nThe parameters are estimated using the observed data.\nIn practice, this estimation is done using a computer in the same way that other estimates, like a sample mean, can be estimated using a computer or calculator.\n\nThe dataset where these data are stored is called `elmhurst`.\nThe first 5 rows of this dataset are given in @tbl-elmhurst-data.\n\n\n::: {#tbl-elmhurst-data .cell tbl-cap='First five rows of the `elmhurst` dataset.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
family_income gift_aid price_paid
92.92 21.7 14.28
0.25 27.5 8.53
53.09 27.8 14.25
50.20 27.2 8.78
137.61 18.0 24.00
\n\n`````\n:::\n:::\n\n\nWe can see that family income is recorded in a variable called `family_income` and gift aid from university is recorded in a variable called `gift_aid`.\nFor now, we won't worry about the `price_paid` variable.\nWe should also note that these data are from the 2011-2012 academic year, and all monetary amounts are given in \\$1,000s, i.e., the family income of the first student in the data shown in @tbl-elmhurst-data is \\$92,920 and they received a gift aid of \\$21,700.\n(The data source states that all numbers have been rounded to the nearest whole dollar.)\n\nStatistical software is usually used to compute the least squares line and the typical output generated as a result of fitting regression models looks like the one shown in @tbl-rOutputForIncomeAidLSRLine.\nFor now we will focus on the first column of the output, which lists ${b}_0$ and ${b}_1.$ In @sec-inf-model-slr we will dive deeper into the remaining columns which give us information on how accurate and precise these values of intercept and slope that are calculated from a sample of 50 students are in estimating the population parameters of intercept and slope for *all* students.\n\n\n::: {.cell}\n\n:::\n\n::: {#tbl-rOutputForIncomeAidLSRLine .cell tbl-cap='Summary of least squares fit for the Elmhurst data.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
term estimate std.error statistic p.value
(Intercept) 24.32 1.29 18.83 <0.0001
family_income -0.04 0.01 -3.98 2e-04
\n\n`````\n:::\n:::\n\n\nThe model output tells us that the intercept is approximately 24.319 and the slope on `family_income` is approximately -0.043.\n\nBut what do these values mean?\nInterpreting parameters in a regression model is often one of the most important steps in the analysis.\n\n::: {.workedexample data-latex=\"\"}\nThe intercept and slope estimates for the Elmhurst data are $b_0$ = 24.319 and $b_1$ = -0.043.\nWhat do these numbers really mean?\n\n------------------------------------------------------------------------\n\nInterpreting the slope parameter is helpful in almost any application.\nFor each additional \\$1,000 of family income, we would expect a student to receive a net difference of 1,000 $\\times$ (-0.0431) = -\\$43.10 in aid on average, i.e., \\$43.10 *less*.\nNote that a higher family income corresponds to less aid because the coefficient of family income is negative in the model.\nWe must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational.\nThat is, increasing a particular student's family income may not cause the student's aid to drop.\n(Although it would be reasonable to contact the college and ask if the relationship is causal, i.e., if Elmhurst College's aid decisions are partially based on students' family income.)\n\nThe estimated intercept $b_0$ = 24.319 describes the average aid if a student's family had no income, \\$24,319.\nThe meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \\$0.\nIn other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero.\n:::\n\n::: {.important data-latex=\"\"}\n**Interpreting parameters estimated by least squares.**\n\nThe slope describes the estimated difference in the predicted average outcome of $y$ if the predictor variable $x$ happened to be one unit larger.\nThe intercept describes the average outcome of $y$ if $x = 0$ *and* the linear model is valid all the way to $x = 0$ (values of $x = 0$ are not observed or relevant in many applications).\n:::\n\nIf you would like to learn more about using R to fit linear models, see @sec-model-tutorials for the interactive R tutorials.\nAn alternative way of calculating the values of intercept and slope of a least squares line is manual calculations using formulas.\nWhile manual calculations are not commonly used by practicing statisticians and data scientists, it is useful to work through the first time you're learning about the least squares line and modeling in general.\nCalculating the values by hand leverages two properties of the least squares line:\n\n1. The slope of the least squares line can be estimated by\n\n$$\nb_1 = \\frac{s_y}{s_x} r\n$$\n\nwhere $r$ is the correlation between the two variables, and $s_x$ and $s_y$ are the sample standard deviations of the predictor and outcome, respectively.\n\n2. If $\\bar{x}$ is the sample mean of the predictor variable and $\\bar{y}$ is the sample mean of the outcome variable, then the point $(\\bar{x}, \\bar{y})$ falls on the least squares line.\n\n@tbl-summaryStatsElmhurstRegr shows the sample means for the family income and gift aid as \\$101,780 and \\$19,940, respectively.\nWe could plot the point $(102, 19.9)$ on @fig-elmhurstScatterWLine to verify it falls on the least squares line (the solid line).\n\n\n::: {#tbl-summaryStatsElmhurstRegr .cell tbl-cap='Summary statistics for family income and gift aid.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n\n
Family income, x
Gift aid, y
mean sd mean sd r
102 63.2 19.9 5.46 -0.499
\n\n`````\n:::\n:::\n\n\nNext, we formally find the point estimates $b_0$ and $b_1$ of the parameters $\\beta_0$ and $\\beta_1.$\n\n::: {.workedexample data-latex=\"\"}\nUsing the summary statistics in @tbl-summaryStatsElmhurstRegr, compute the slope for the regression line of gift aid against family income.\n\n------------------------------------------------------------------------\n\nCompute the slope using the summary statistics from @tbl-summaryStatsElmhurstRegr:\n\n$$\nb_1 = \\frac{s_y}{s_x} r = \\frac{5.46}{63.2}(-0.499) = -0.0431\n$$\n:::\n\nYou might recall the form of a line from math class, which we can use to find the model fit, including the estimate of $b_0.$ Given the slope of a line and a point on the line, $(x_0, y_0),$ the equation for the line can be written as\n\n$$\ny - y_0 = slope\\times (x - x_0)\n$$\n\n::: {.important data-latex=\"\"}\n**Identifying the least squares line from summary statistics.**\n\nTo identify the least squares line from summary statistics:\n\n- Estimate the slope parameter, $b_1 = (s_y / s_x) r.$\n- Note that the point $(\\bar{x}, \\bar{y})$ is on the least squares line, use $x_0 = \\bar{x}$ and $y_0 = \\bar{y}$ with the point-slope equation: $y - \\bar{y} = b_1 (x - \\bar{x}).$\n- Simplify the equation, we get $y = \\bar{y} - b_1 \\bar{x} + b_1 x,$ which reveals that $b_0 = \\bar{y} - b_1 \\bar{x}.$\n:::\n\n::: {.workedexample data-latex=\"\"}\nUsing the point (102, 19.9) from the sample means and the slope estimate $b_1 = -0.0431,$ find the least-squares line for predicting aid based on family income.\n\n------------------------------------------------------------------------\n\nApply the point-slope equation using $(102, 19.9)$ and the slope $b_1 = -0.0431$:\n\n$$\n\\begin{aligned}\ny - y_0 &= b_1 (x - x_0) \\\\\ny - 19.9 &= -0.0431 (x - 102)\n\\end{aligned}\n$$\n\nExpanding the right side and then adding 19.9 to each side, the equation simplifies:\n\n$$\n\\widehat{\\texttt{aid}} = 24.3 - 0.0431 \\times \\texttt{family\\_income}\n$$\n\nHere we have replaced $y$ with $\\widehat{\\texttt{aid}}$ and $x$ with $\\texttt{family\\_income}$ to put the equation in context.\nThe final least squares equation should always include a \"hat\" on the variable being predicted, whether it is a generic $``y\"$ or a named variable like $``aid\"$.\n:::\n\n::: {.workedexample data-latex=\"\"}\nSuppose a high school senior is considering Elmhurst College.\nCan they simply use the linear equation that we have estimated to calculate her financial aid from the university?\n\n------------------------------------------------------------------------\n\nShe may use it as an estimate, though some qualifiers on this approach are important.\nFirst, the data all come from one freshman class, and the way aid is determined by the university may change from year to year.\nSecond, the equation will provide an imperfect estimate.\nWhile the linear equation is good at capturing the trend in the data, no individual student's aid will be perfectly predicted (as can be seen from the individual data points in the cloud around the line).\n:::\n\n### Extrapolation is treacherous\n\n> *When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6 it was 10 degrees.* *Today it hit almost 80.* *At this rate, by August it will be 220 degrees.* *So clearly folks the climate debate rages on.*[^07-model-slr-7]\n>\n> Stephen Colbert April 6th, 2010\n\n[^07-model-slr-7]: \n\nLinear models can be used to approximate the relationship between two variables.\nHowever, like any model, they have real limitations.\nLinear regression is simply a modeling framework.\nThe truth is almost always much more complex than a simple line.\nFor example, we do not know how the data outside of our limited window will behave.\n\n::: {.workedexample data-latex=\"\"}\nUse the model $\\widehat{\\texttt{aid}} = 24.3 - 0.0431 \\times \\texttt{family\\_income}$ to estimate the aid of another freshman student whose family had income of \\$1 million.\n\n------------------------------------------------------------------------\n\nWe want to calculate the aid for a family with \\$1 million income.\nNote that in our model this will be represented as 1,000 since the data are in \\$1,000s.\n\n$$\n24.3 - 0.0431 \\times 1000 = -18.8\n$$\n\nThe model predicts this student will have -\\$18,800 in aid (!).\nHowever, Elmhurst College does not offer *negative aid* where they select some students to pay extra on top of tuition to attend.\n:::\n\nApplying a model estimate to values outside of the realm of the original data is called **extrapolation**.\nGenerally, a linear model is only an approximation of the real relationship between two variables.\nIf we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.\n\n\n\n\n\n### Describing the strength of a fit {#sec-r-squared}\n\nWe evaluated the strength of the linear relationship between two variables earlier using the correlation, $r.$ However, it is more common to explain the strength of a linear fit using $R^2,$ called **R-squared**.\nIf provided with a linear model, we might like to describe how closely the data cluster around the linear fit.\n\n\n\n\n\nThe $R^2$ of a linear model describes the amount of variation in the outcome variable that is explained by the least squares line.\nFor example, consider the Elmhurst data, shown in @fig-elmhurstScatterWLine).\nThe variance of the outcome variable, aid received, is about $s_{aid}^2 \\approx 29.8$ million (calculated from the data, some of which is shown in @tbl-elmhurst-data).\nHowever, if we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student's family income.\nThe variability in the residuals describes how much variation remains after using the model: $s_{_{RES}}^2 \\approx 22.4$ million.\nIn short, there was a reduction of\n\n$$\n\\frac{s_{aid}^2 - s_{_{RES}}^2}{s_{aid}^2}\n = \\frac{29800 - 22400}{29800}\n = \\frac{7500}{29800}\n \\approx 0.25,\n$$\n\nor about 25%, of the outcome variable's variation by using information about family income for predicting aid using a linear model.\nIt turns out that $R^2$ corresponds exactly to the squared value of the correlation:\n\n$$\nr = -0.499 \\rightarrow R^2 = 0.25\n$$\n\n::: {.guidedpractice data-latex=\"\"}\nIf a linear model has a very strong negative relationship with a correlation of -0.97, how much of the variation in the outcome is explained by the predictor?[^07-model-slr-8]\n:::\n\n[^07-model-slr-8]: About $R^2 = (-0.97)^2 = 0.94$ or 94% of the variation in the outcome variable is explained by the linear model.\n\n$R^2$ is also called the **coefficient of determination**.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Coefficient of determination: proportion of variability in the outcome variable explained by the model.**\n\nSince $r$ is always between -1 and 1, $R^2$ will always be between 0 and 1.\nThis statistic is called the **coefficient of determination**, and it measures the proportion of variation in the outcome variable, $y,$ that can be explained by the linear model with predictor $x.$\n:::\n\nMore generally, $R^2$ can be calculated as a ratio of a measure of variability around the line divided by a measure of total variability.\n\n::: {.important data-latex=\"\"}\n**Sums of squares to measure variability in** $y.$\n\nWe can measure the variability in the $y$ values by how far they tend to fall from their mean, $\\bar{y}.$ We define this value as the **total sum of squares**, calculated using the formula below, where $y_i$ represents each $y$ value in the sample, and $\\bar{y}$ represents the mean of the $y$ values in the sample.\n\n$$\nSST = (y_1 - \\bar{y})^2 + (y_2 - \\bar{y})^2 + \\cdots + (y_n - \\bar{y})^2.\n$$\n\nLeft-over variability in the $y$ values if we know $x$ can be measured by the **sum of squared errors**, or sum of squared residuals, calculated using the formula below, where $\\hat{y}_i$ represents the predicted value of $y_i$ based on the least squares regression.[^07-model-slr-9],\n\n$$\n\\begin{aligned}\nSSE &= (y_1 - \\hat{y}_1)^2 + (y_2 - \\hat{y}_2)^2 + \\cdots + (y_n - \\hat{y}_n)^2\\\\\n&= e_{1}^2 + e_{2}^2 + \\dots + e_{n}^2\n\\end{aligned}\n$$\n\nThe coefficient of determination can then be calculated as\n\n$$\nR^2 = \\frac{SST - SSE}{SST} = 1 - \\frac{SSE}{SST}\n$$\n:::\n\n[^07-model-slr-9]: The difference $SST - SSE$ is called the **regression sum of squares**, $SSR,$ and can also be calculated as $SSR = (\\hat{y}_1 - \\bar{y})^2 + (\\hat{y}_2 - \\bar{y})^2 + \\cdots + (\\hat{y}_n - \\bar{y})^2.$ $SSR$ represents the variation in $y$ that was accounted for in our model.\n\n\n\n\n\n::: {.workedexample data-latex=\"\"}\nAmong 50 students in the `elmhurst` dataset, the total variability in gift aid is $SST = 1461$.[^07-model-slr-10]\nThe sum of squared residuals is $SSE = 1098.$ Find $R^2.$\n\n------------------------------------------------------------------------\n\nSince we know $SSE$ and $SST,$ we can calculate $R^2$ as\n\n$$\nR^2 = 1 - \\frac{SSE}{SST} = 1 - \\frac{1098}{1461} = 0.25,\n$$\n\nthe same value we found when we squared the correlation: $R^2 = (-0.499)^2 = 0.25.$\n:::\n\n[^07-model-slr-10]: $SST$ can be calculated by finding the sample variance of the outcome variable, $s^2$ and multiplying by $n-1.$\n\n### Categorical predictors with two levels {#sec-categorical-predictor-two-levels}\n\nCategorical variables are also useful in predicting outcomes.\nHere we consider a categorical predictor with two levels (recall that a *level* is the same as a *category*).\nWe'll consider Ebay auctions for a video game, *Mario Kart* for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded.\nHere we want to predict total price based on game condition, which takes values `used` and `new`.\n\n::: {.data data-latex=\"\"}\nThe [`mariokart`](http://openintrostat.github.io/openintro/reference/mariokart.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n\n::: {.cell}\n\n:::\n\n\nA plot of the auction data is shown in @fig-marioKartNewUsed.\nNote that the original dataset contains some Mario Kart games being sold at prices above \\$100 but for this analysis we have limited our focus to the 141 Mario Kart games that were sold below \\$100.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Total auction prices for the video game Mario Kart, divided into used\n($x = 0$) and new ($x = 1$) condition games. The least squares regression\nline is also shown.](07-model-slr_files/figure-html/fig-marioKartNewUsed-1.png){#fig-marioKartNewUsed width=90%}\n:::\n:::\n\n\nTo incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form.\nWe will do so using an **indicator variable** called `condnew`, which takes value 1 when the game is new and 0 when the game is used.\nUsing this indicator variable, the linear model may be written as\n\n$$\n\\widehat{\\texttt{price}} = b_0 + b_1 \\times \\texttt{condnew}\n$$\n\nThe parameter estimates are given in @tbl-marioKartNewUsedRegrSummary.\n\n\n\n\n::: {#tbl-marioKartNewUsedRegrSummary .cell tbl-cap='Least squares regression summary for the final auction price against the\ncondition of the game.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
term estimate std.error statistic p.value
(Intercept) 42.9 0.81 52.67 <0.0001
condnew 10.9 1.26 8.66 <0.0001
\n\n`````\n:::\n:::\n\n\nUsing values from Table @tbl-marioKartNewUsedRegrSummary, the model equation can be summarized as\n\n$$\n\\widehat{\\texttt{price}} = 42.87 + 10.9 \\times \\texttt{condnew}\n$$\n\n::: {.workedexample data-latex=\"\"}\nInterpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.\n\n------------------------------------------------------------------------\n\nThe intercept is the estimated price when `condnew` has a value 0, i.e., when the game is in used condition.\nThat is, the average selling price of a used version of the game is \\$42.9.\nThe slope indicates that, on average, new games sell for about \\$10.9 more than used games.\n:::\n\n::: {.important data-latex=\"\"}\n**Interpreting model estimates for categorical predictors.**\n\nThe estimated intercept is the value of the outcome variable for the first category (i.e., the category corresponding to an indicator value of 0).\nThe estimated slope is the average change in the outcome variable between the two categories.\n:::\n\nNote that, fundamentally, the intercept and slope interpretations do not change when modeling categorical variables with two levels.\nHowever, when the predictor variable is binary, the coefficient estimates ($b_0$ and $b_1$) are directly interpretable with respect to the dataset at hand.\n\nWe'll elaborate further on modeling categorical predictors in @sec-model-mlr, where we examine the influence of many predictor variables simultaneously using multiple regression.\n\n## Outliers in linear regression {#outliers-in-regression}\n\nIn this section, we identify criteria for determining which outliers are important and influential.\nOutliers in regression are observations that fall far from the cloud of points.\nThese points are especially important because they can have a strong influence on the least squares line.\n\n::: {.workedexample data-latex=\"\"}\nThere are three plots shown in @fig-outlier-plots-1 along with the corresponding least squares line and residual plots.\nFor each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.\nRecall that an outlier is any point that does not appear to belong with the vast majority of the other points.\n\n------------------------------------------------------------------------\n\n- A: There is one outlier far from the other points, though it only appears to slightly influence the line.\n\n- B: There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn't very influential.\n\n- C: There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud does not appear to fit very well.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three plots, each with a least squares line and corresponding residual plot.\nEach dataset has at least one outlier.\n](07-model-slr_files/figure-html/fig-outlier-plots-1-1.png){#fig-outlier-plots-1 width=100%}\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nThere are three plots shown in @fig-outlier-plots-2 along with the least squares line and residual plots.\nAs you did in previous exercise, for each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.\nRecall that an outlier is any point that does not appear to belong with the vast majority of the other points.\n\n------------------------------------------------------------------------\n\n- D: There is a primary cloud and then a small secondary cloud of four outliers.\n The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere.\n There might be an interesting explanation for the dual clouds, which is something that could be investigated.\n\n- E: There is no obvious trend in the main cloud of points and the outlier on the right appears to largely (and problematically) control the slope of the least squares line.\n\n- F: There is one outlier far from the cloud.\n However, it falls quite close to the least squares line and does not appear to be very influential.\n:::\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three plots, each with a least squares line and residual plot.\nAll datasets have at least one outlier.\n](07-model-slr_files/figure-html/fig-outlier-plots-2-1.png){#fig-outlier-plots-2 width=100%}\n:::\n:::\n\n\nExamine the residual plots in Figures @fig-outlier-plots-1 and @fig-outlier-plots-2.\nIn Plots C, D, and E, you will probably find that there are a few observations which are both away from the remaining points along the x-axis and not in the trajectory of the trend in the rest of the data.\nIn these cases, the outliers influenced the slope of the least squares lines.\nIn Plot E, the bulk of the data show no clear trend, but if we fit a line to these data, we impose a trend where there isn't really one.\n\n::: {.important data-latex=\"\"}\n**Leverage.**\n\nPoints that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with **high leverage** or **leverage points**.\n:::\n\nPoints that fall horizontally far from the line are points of high leverage; these points can strongly influence the slope of the least squares line.\nIf one of these high leverage points does appear to actually invoke its influence on the slope of the line -- as in Plots C, D, and E of @fig-outlier-plots-1 and @fig-outlier-plots-2 -- then we call it an **influential point**.\nUsually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far from the least squares line.\n\n\n\n\n\n::: {.important data-latex=\"\"}\n**Types of outliers.**\n\nA point (or a group of points) that stands out from the rest of the data is called an outlier.\nOutliers that fall horizontally away from the center of the cloud of points are called leverage points.\nOutliers that influence on the slope of the line are called influential points.\n:::\n\nIt is tempting to remove outliers.\nDon't do this without a very good reason.\nModels that ignore exceptional (and interesting) cases often perform poorly.\nFor instance, if a financial firm ignored the largest market swings -- the \"outliers\" -- they would soon go bankrupt by making poorly thought-out investments.\n\n\\clearpage\n\n## Chapter review {#chp7-review}\n\n### Summary\n\nThroughout this chapter, the nuances of the linear model have been described.\nYou have learned how to create a linear model with explanatory variables that are numerical (e.g., total possum length) and those that are categorical (e.g., whether a video game was new).\nThe residuals in a linear model are an important metric used to understand how well a model fits; high leverage points, influential points, and other types of outliers can impact the fit of a model.\nCorrelation is a measure of the strength and direction of the linear relationship of two variables, without specifying which variable is the explanatory and which is the outcome.\nFuture chapters will focus on generalizing the linear model from the sample of data to claims about the population of interest.\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
coefficient of determination influential point predictor
correlation least squares line R-squared
extrapolation leverage point residuals
high leverage outcome sum of squared error
indicator variable outlier total sum of squares
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#chp7-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-07].\n\n::: {.exercises data-latex=\"\"}\n1. **Visualize the residuals.** \nThe scatterplots shown below each have a superimposed regression line. If we were to construct a residual plot (residuals versus $x$) for each, describe in words what those plots would look like.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-38-1.png){width=90%}\n :::\n :::\n \n1. **Trends in the residuals.** \nShown below are two plots of residuals remaining after fitting a linear model to two different sets of data. \nFor each plot, describe important features and determine if a linear model would be appropriate for these data. \nExplain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-39-1.png){width=90%}\n :::\n :::\n \n1. **Identify relationships, I.** \nFor each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-40-1.png){width=90%}\n :::\n :::\n\n1. **Identify relationships, II.** \nFor each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-41-1.png){width=90%}\n :::\n :::\n\n1. **Midterms and final.** \nThe two scatterplots below show the relationship between the overall course average and two midterm exams (Exam 1 and Exam 2) recorded for 233 students during several years for a statistics course at a university.^[The [`exam_grades`](http://openintrostat.github.io/openintro/reference/exam_grades.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-42-1.png){width=90%}\n :::\n :::\n\n a. Based on these graphs, which of the two exams has the strongest correlation with the course grade? Explain.\n\n b. Can you think of a reason why the correlation between the exam you chose in part (a) and the course grade is higher?\n \n \\clearpage\n\n1. **Partners' ages and heights.** \nThe Great Britain Office of Population Census and Surveys collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the partners. The scatterplot on the left shows the heights of the partners plotted against each other and the plot on the right shows the ages of the partners plotted against each other.^[The [`husbands_wives`](http://openintrostat.github.io/openintro/reference/husbands_wives.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-43-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between partners' ages.\n\n b. Describe the relationship between partners' heights.\n\n c. Which plot shows a stronger correlation? Explain your reasoning.\n\n d. Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between partners' heights?\n\n1. **Match the correlation, I.** \nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-44-1.png){width=90%}\n :::\n :::\n\n a. $r = -0.7$\n\n b. $r = 0.45$\n\n c. $r = 0.06$\n\n d. $r = 0.92$\n\n1. **Match the correlation, II.** \nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-45-1.png){width=90%}\n :::\n :::\n\n a. $r = 0.49$\n\n b. $r = -0.48$\n\n c. $r = -0.03$\n\n d. $r = -0.85$\n\n1. **Body measurements, correlation.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nThe scatterplot below shows the relationship between height and shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-46-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between shoulder girth and height.\n\n b. How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?\n\n1. **Compare correlations.** \nEduardo and Rosie are both collecting data on number of rainy days in a year and the total rainfall for the year. Eduardo records rainfall in inches and Rosie in centimeters. How will their correlation coefficients compare?\n\n1. **The Coast Starlight, correlation.** \nThe Coast Starlight Amtrak train runs from Seattle to Los Angeles. \nThe scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).^[The [`coast_starlight`](http://openintrostat.github.io/openintro/reference/coast_starlight.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-47-1.png){width=70%}\n :::\n :::\n\n a. Describe the relationship between distance and travel time.\n\n b. How would the relationship change if travel time was instead measured in hours, and distance was instead measured in kilometers?\n\n c. Correlation between travel time (in miles) and distance (in minutes) is $r = 0.636$. What is the correlation between travel time (in kilometers) and distance (in hours)?\n\n1. **Crawling babies, correlation.** \nA study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. \nInfants born during the study year were split into twelve groups, one for each birth month. \nWe consider the average crawling age of babies in each group against the average temperature when the babies are six months old (that's when babies often begin trying to crawl). \nTemperature is measured in degrees Fahrenheit (F) and age is measured in weeks.^[The [`babies_crawl`](http://openintrostat.github.io/openintro/reference/babies_crawl.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Benson:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-48-1.png){width=70%}\n :::\n :::\n\n a. Describe the relationship between temperature and crawling age.\n\n b. How would the relationship change if temperature was measured in degrees Celsius (C) and age was measured in months?\n\n c. The correlation between temperature in F and age in weeks was $r=-0.70$. If we converted the temperature to C and age to months, what would the correlation be?\n\n1. **Partners' ages.** \nWhat would be the correlation between the ages of partners if people always dated others who are \n\n a. 3 years younger than themselves?\n\n b. 2 years older than themselves?\n\n c. half as old as themselves?\n\n1. **Graduate degrees and salaries.** \nWhat would be the correlation between the annual salaries of people with and without a graduate degree at a company if for a certain type of position someone with a graduate degree always made \n\n a. \\$5,000 more than those without a graduate degree?\n\n b. 25% more than those without a graduate degree?\n\n c. 15% less than those without a graduate degree?\n\n1. **Units of regression.** \nConsider a regression predicting the number of calories (cal) from width (cm) for a sample of square shaped chocolate brownies. What are the units of the correlation coefficient, the intercept, and the slope?\n\n1. **Which is higher?**\nDetermine if (I) or (II) is higher or if they are equal: *\"For a regression line, the uncertainty associated with the slope estimate, $b_1$, is higher when (I) there is a lot of scatter around the regression line or (II) there is very little scatter around the regression line.\"* Explain your reasoning.\n\n1. **Over-under, I.** \nSuppose we fit a regression line to predict the shelf life of an apple based on its weight. For a particular apple, we predict the shelf life to be 4.6 days. The apple's residual is -0.6 days. Did we over or under estimate the shelf-life of the apple? Explain your reasoning.\n\n1. **Over-under, II.** \nSuppose we fit a regression line to predict the number of incidents of skin cancer per 1,000 people from the number of sunny days in a year. \nFor a particular year, we predict the incidence of skin cancer to be 1.5 per 1,000 people, and the residual for this year is 0.5. \nDid we over or under estimate the incidence of skin cancer? Explain your reasoning.\n\n1. **Starbucks, calories, and protein.** \nThe scatterplot below shows the relationship between the number of calories and amount of protein (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we might be interested in predicting the amount of protein a menu item has based on its calorie content.^[The [`starbucks`](http://openintrostat.github.io/openintro/reference/starbucks.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-49-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between number of calories and amount of protein (in grams) that Starbucks food menu items contain.\n\n b. In this scenario, what are the predictor and outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. What does the residuals vs. predicted plot tell us about the variability in our prediction errors based on this model for items with lower vs. higher predicted protein?\n\n1. **Starbucks, calories, and carbs.** \nThe scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we might be interested in predicting the amount of carbs a menu item has based on its calorie content.^[The [`starbucks`](http://openintrostat.github.io/openintro/reference/starbucks.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-50-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.\n\n b. In this scenario, what are the predictor and outcome variables?\n\n c. Why might we want to fit a regression line to these data?\n\n d. What does the residuals vs. predicted plot tell us about the variability in our prediction errors based on this model for items with lower vs. higher predicted carbs?\n\n1. **The Coast Starlight, regression.** \nThe Coast Starlight Amtrak train runs from Seattle to Los Angeles. \nThe scatterplot below displays the distance between each stop (in miles) and the amount of time it takes to travel from one stop to another (in minutes).\nThe mean travel time from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113 minutes. The mean distance traveled from one stop to the next is 108 miles with a standard deviation of 99 miles. \nThe correlation between travel time and distance is 0.636.^[The [`coast_starlight`](http://openintrostat.github.io/openintro/reference/coast_starlight.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-51-1.png){width=90%}\n :::\n :::\n\n a. Write the equation of the regression line for predicting travel time.\n\n b. Interpret the slope and the intercept in this context.\n\n c. Calculate $R^2$ of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret $R^2$ in the context of the application.\n\n d. The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.\n\n e. It actually takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.\n\n f. Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?\n \n \\clearpage\n\n1. **Body measurements, regression.** \nResearchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals. \nThe scatterplot below shows the relationship between height and shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.\nThe mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. \nThe mean height is 171.14 cm with a standard deviation of 9.41 cm. \nThe correlation between height and shoulder girth is 0.67.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-52-1.png){width=90%}\n :::\n :::\n\n a. Write the equation of the regression line for predicting height.\n\n b. Interpret the slope and the intercept in this context.\n\n c. Calculate $R^2$ of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.\n\n d. A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.\n\n e. The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.\n\n f. A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?\n \n \\clearpage\n\n1. **Poverty and unemployment.** \nThe following scatterplot shows the relationship between percent of population below the poverty level (`poverty`) from unemployment rate among those ages 20-64 (`unemployment_rate`) in counties in the US, as provided by data from the 2019 American Community Survey. \nThe regression output for the model for predicting `poverty` from `unemployment_rate` is also provided.^[The [`\ncounty_2019`](http://openintrostat.github.io/usdata/reference/\ncounty_2019.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-53-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) 4.60 0.349 13.2 <0.0001
unemployment_rate 2.05 0.062 33.1 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the linear model.\n\n b. Interpret the intercept.\n\n c. Interpret the slope.\n\n d. For this model $R^2$ is 46%. Interpret this value.\n\n e. Calculate the correlation coefficient.\n \n \\clearpage\n\n1. **Cats weights.** \nThe following regression output is for predicting the heart weight (`Hwt`, in g) of cats from their body weight (`Bwt`, in kg). The coefficients are estimated using a dataset of 144 domestic cats.^[The [`cats`](https://cran.r-project.org/web/packages/MASS/MASS.pdf) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-54-1.png){width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
term estimate std.error statistic p.value
(Intercept) -0.357 0.692 -0.515 0.6072
Bwt 4.034 0.250 16.119 <0.0001
\n \n `````\n :::\n :::\n\n a. Write out the linear model.\n\n b. Interpret the intercept.\n\n c. Interpret the slope.\n\n d. The $R^2$ of this model is 65%. Interpret $R^2$.\n\n e. Calculate the correlation coefficient.\n\n1. **Outliers, I.** \nIdentify the outliers in the scatterplots shown below, and determine what type of outliers they are. Explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-55-1.png){width=100%}\n :::\n :::\n \n \\clearpage\n\n1. **Outliers, II.** \nIdentify the outliers in the scatterplots shown below and determine what type of outliers they are. Explain your reasoning.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-56-1.png){width=100%}\n :::\n :::\n\n1. **Urban homeowners, outliers.** \nThe scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. \nThere are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.^[The [`urban_owner`](http://openintrostat.github.io/openintro/reference/urban_owner.html) data used in this exercise can be found in the [**usdata**](http://openintrostat.github.io/usdata) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-57-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between the percent of families who own their home and the percent of the population living in urban areas.\n\n b. The outlier at the bottom right corner is District of Columbia, where 100% of the population is considered urban. What type of an outlier is this observation?\n \n \\pagebreak\n\n1. **Crawling babies, outliers.**\nA study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. \nThe plot below shows the relationship between average crawling age of babies born in each month and the average temperature in the month when the babies are six months old.\nThe plot reveals a potential outlying month when the average temperature is about 53F and average crawling age is about 28.5 weeks. \nDoes this point have high leverage? Is it an influential point?^[The [`babies_crawl`](http://openintrostat.github.io/openintro/reference/babies_crawl.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Benson:1993]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-58-1.png){width=90%}\n :::\n :::\n\n1. **True / False.** \nDetermine if the following statements are true or false. \nIf false, explain why.\n\n a. A correlation coefficient of -0.90 indicates a stronger linear relationship than a correlation of 0.5.\n\n b. Correlation is a measure of the association between any two variables.\n\n1. **Cherry trees.** \nThe scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. \nThe diameter of the tree is measured 4.5 feet above the ground.^[The [`trees`](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/trees.html) data used in this exercise can be found in the [**datasets**](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-59-1.png){width=90%}\n :::\n :::\n\n a. Describe the relationship between volume and height of these trees.\n\n b. Describe the relationship between volume and diameter of these trees.\n\n c. Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.\n\n1. **Match the correlation, III.**\nMatch each correlation to the corresponding scatterplot.^[The [`corr_match`](http://openintrostat.github.io/openintro/reference/corr_match.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.]\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-60-1.png){width=100%}\n :::\n :::\n \n a. r = 0.69\n\n b. r = 0.09\n\n c. r = -0.91\n\n d. r = 0.97\n\n1. **Helmets and lunches.** \nThe scatterplot shows the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (`lunch`) and the percentage of bike riders in the neighborhood wearing helmets (`helmet`). \nThe average percentage of children receiving reduced-fee lunches is 30.833% with a standard deviation of 26.724% and the average percentage of bike riders wearing helmets is 30.883% with a standard deviation of 16.948%.\n\n ::: {.cell}\n ::: {.cell-output-display}\n ![](07-model-slr_files/figure-html/unnamed-chunk-61-1.png){width=90%}\n :::\n :::\n\n a. If the $R^2$ for the least-squares regression line for these data is 72%, what is the correlation between `lunch` and `helmet`?\n\n b. Calculate the slope and intercept for the least-squares regression line for these data.\n\n c. Interpret the intercept of the least-squares regression line in the context of the application.\n\n d. Interpret the slope of the least-squares regression line in the context of the application.\n\n e. What would the value of the residual be for a neighborhood where 40% of the children receive reduced-fee lunches and 40% of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.\n\n\n:::\n", "supporting": [ "07-model-slr_files" ],