diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md
index 88a11967..137d4539 100644
--- a/lectures/simple_linear_regression.md
+++ b/lectures/simple_linear_regression.md
@@ -57,12 +57,18 @@ df
We can use a scatter plot of the data to see the relationship between $y_i$ (ice-cream sales in dollars (\$\'s)) and $x_i$ (degrees Celsius).
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Scatter plot"
+ name: sales-v-temp1
+---
ax = df.plot(
x='X',
y='Y',
kind='scatter',
- ylabel='Ice-Cream Sales ($\'s)',
- xlabel='Degrees Celcius'
+ ylabel='Ice-cream sales ($\'s)',
+ xlabel='Degrees celcius'
)
```
@@ -83,9 +89,16 @@ df['Y_hat'] = α + β * df['X']
```
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Scatter plot with a line of fit"
+ name: sales-v-temp2
+---
fig, ax = plt.subplots()
-df.plot(x='X',y='Y', kind='scatter', ax=ax)
-df.plot(x='X',y='Y_hat', kind='line', ax=ax)
+ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
+ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
+plt.show()
```
We can see that this model does a poor job of estimating the relationship.
@@ -98,9 +111,16 @@ df['Y_hat'] = α + β * df['X']
```
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Scatter plot with a line of fit #2"
+ name: sales-v-temp3
+---
fig, ax = plt.subplots()
-df.plot(x='X',y='Y', kind='scatter', ax=ax)
-df.plot(x='X',y='Y_hat', kind='line', ax=ax)
+ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
+ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
+plt.show()
```
```{code-cell} ipython3
@@ -109,12 +129,19 @@ df['Y_hat'] = α + β * df['X']
```
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Scatter plot with a line of fit #3"
+ name: sales-v-temp4
+---
fig, ax = plt.subplots()
-df.plot(x='X',y='Y', kind='scatter', ax=ax)
-df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
+ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
+ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
+plt.show()
```
-However we need to think about formalising this guessing process by thinking of this problem as an optimization problem.
+However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem.
Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals
@@ -134,13 +161,20 @@ df
```
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Plot of the residuals"
+ name: plt-residuals
+---
fig, ax = plt.subplots()
-df.plot(x='X',y='Y', kind='scatter', ax=ax)
-df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
-plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
+ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
+ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
+plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r')
+plt.show()
```
-The Ordinary Least Squares (OLS) method, as the name suggests, chooses $\alpha$ and $\beta$ in such a way that **minimises** the Sum of the Squared Residuals (SSR).
+The Ordinary Least Squares (OLS) method chooses $\alpha$ and $\beta$ in such a way that **minimizes** the sum of the squared residuals (SSR).
$$
\min_{\alpha,\beta} \sum_{i=1}^{N}{\hat{e}_i^2} = \min_{\alpha,\beta} \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
@@ -152,7 +186,7 @@ $$
C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
$$
-that we would like to minimise with parameters $\alpha$ and $\beta$.
+that we would like to minimize with parameters $\alpha$ and $\beta$.
## How does error change with respect to $\alpha$ and $\beta$
@@ -173,9 +207,15 @@ for β in np.arange(20,100,0.5):
errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum()
```
-Ploting the error
+Plotting the error
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Plotting the error"
+ name: plt-errors
+---
ax = pd.Series(errors).plot(xlabel='β', ylabel='error')
plt.axvline(β_optimal, color='r');
```
@@ -188,9 +228,15 @@ for α in np.arange(-500,500,5):
errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum()
```
-Ploting the error
+Plotting the error
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "Plotting the error (2)"
+ name: plt-errors-2
+---
ax = pd.Series(errors).plot(xlabel='α', ylabel='error')
plt.axvline(α_optimal, color='r');
```
@@ -322,22 +368,21 @@ print(α)
Now we can plot the OLS solution
```{code-cell} ipython3
+---
+mystnb:
+ figure:
+ caption: "OLS line of best fit"
+ name: plt-ols
+---
df['Y_hat'] = α + β * df['X']
df['error'] = df['Y_hat'] - df['Y']
fig, ax = plt.subplots()
-df.plot(x='X',y='Y', kind='scatter', ax=ax)
-df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
+ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
+ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
```
-:::{admonition} Why use OLS?
-TODO
-
-1. Discuss mathematical properties for why we have chosen OLS
-:::
-
-
:::{exercise}
:label: slr-ex1
@@ -347,7 +392,7 @@ Let's consider two economic variables GDP per capita and Life Expectancy.
1. What do you think their relationship would be?
2. Gather some data [from our world in data](https://ourworldindata.org)
-3. Use `pandas` to import the `csv` formated data and plot a few different countries of interest
+3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$
5. Plot the line of best fit found using OLS
6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy
@@ -363,13 +408,13 @@ Let's consider two economic variables GDP per capita and Life Expectancy.
:::
-You can download {download}`a copy of the data here <_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv>` if you get stuck
+You can download {download}`a copy of the data here ` if you get stuck
**Q3:** Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
```{code-cell} ipython3
-fl = "_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv" # TODO: Replace with GitHub link
-df = pd.read_csv(fl, nrows=10)
+data_url = "https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv"
+df = pd.read_csv(data_url, nrows=10)
```
```{code-cell} ipython3
@@ -386,7 +431,7 @@ So let's built a list of the columns we want to import
```{code-cell} ipython3
cols = ['Code', 'Year', 'Life expectancy at birth (historical)', 'GDP per capita']
-df = pd.read_csv(fl, usecols=cols)
+df = pd.read_csv(data_url, usecols=cols)
df
```
@@ -453,7 +498,7 @@ df = df[df.year == 2018].reset_index(drop=True).copy()
```
```{code-cell} ipython3
-df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectancy (Years)",);
+df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",);
```
This data shows a couple of interesting relationships.
@@ -461,16 +506,12 @@ This data shows a couple of interesting relationships.
1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectancy
2. there appears to be a positive relationship between GDP per capita and life expectancy. Countries with higher GDP per capita tend to have higher life expectancy outcomes
-Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables
-
-:::{tip}
-ln -> ln == elasticities
-:::
+Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables.
By specifying `logx` you can plot the GDP per Capita data on a log scale
```{code-cell} ipython3
-df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life Expectancy (Years)", logx=True);
+df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)", logx=True);
```
As you can see from this transformation -- a linear model fits the shape of the data more closely.
@@ -528,9 +569,9 @@ plt.vlines(data['log_gdppc'], data['life_expectancy_hat'], data['life_expectancy
:::{exercise}
:label: slr-ex2
-Minimising the sum of squares is not the **only** way to generate the line of best fit.
+Minimizing the sum of squares is not the **only** way to generate the line of best fit.
-For example, we could also consider minimising the sum of the **absolute values**, that would give less weight to outliers.
+For example, we could also consider minimizing the sum of the **absolute values**, that would give less weight to outliers.
Solve for $\alpha$ and $\beta$ using the least absolute values
:::