From 5d14566a19ae20ff1f292209e32af7208040cf0b Mon Sep 17 00:00:00 2001
From: JingkunZhao <155940781+SylviaZhaooo@users.noreply.github.com>
Date: Fri, 27 Sep 2024 15:07:13 +1000
Subject: [PATCH] [simple_linear_regression] Translation (#60)
---
lectures/simple_linear_regression.md | 215 ++++++++++++++-------------
1 file changed, 110 insertions(+), 105 deletions(-)
diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md
index 137d453..83a8ec3 100644
--- a/lectures/simple_linear_regression.md
+++ b/lectures/simple_linear_regression.md
@@ -11,25 +11,30 @@ kernelspec:
name: python3
---
-# Simple Linear Regression Model
+# 简单线性回归模型
```{code-cell} ipython3
import numpy as np
import pandas as pd
+import matplotlib as mpl
import matplotlib.pyplot as plt
+
+FONTPATH = "fonts/SourceHanSerifSC-SemiBold.otf"
+mpl.font_manager.fontManager.addfont(FONTPATH)
+plt.rcParams['font.family'] = ['Source Han Serif SC']
```
-The simple regression model estimates the relationship between two variables $x_i$ and $y_i$
+简单回归模型估计两个变量 $x_i$ 和 $y_i$ 之间的关系
$$
y_i = \alpha + \beta x_i + \epsilon_i, i = 1,2,...,N
$$
-where $\epsilon_i$ represents the error between the line of best fit and the sample values for $y_i$ given $x_i$.
+其中 $\epsilon_i$ 表示最佳拟合线与样本值 $y_i$ 与 $x_i$ 的误差。
-Our goal is to choose values for $\alpha$ and $\beta$ to build a line of "best" fit for some data that is available for variables $x_i$ and $y_i$.
+我们的目标是为 $\alpha$ 和 $\beta$ 选择值来为一些可用的变量 $x_i$ 和 $y_i$ 的数据构建“最佳”拟合线。
-Let us consider a simple dataset of 10 observations for variables $x_i$ and $y_i$:
+让我们考虑一个具有10个观察值的简单数据集,变量为 $x_i$ 和 $y_i$:
| | $y_i$ | $x_i$ |
|-|---|---|
@@ -44,7 +49,7 @@ Let us consider a simple dataset of 10 observations for variables $x_i$ and $y_i
|9| 1800 | 27 |
|10 | 250 | 2 |
-Let us think about $y_i$ as sales for an ice-cream cart, while $x_i$ is a variable that records the day's temperature in Celsius.
+让我们把 $y_i$ 视为一个冰淇淋车的销售额,而 $x_i$ 是记录当天摄氏度温度的变量。
```{code-cell} ipython3
x = [32, 21, 24, 35, 10, 11, 22, 21, 27, 2]
@@ -54,33 +59,33 @@ df.columns = ['X', 'Y']
df
```
-We can use a scatter plot of the data to see the relationship between $y_i$ (ice-cream sales in dollars (\$\'s)) and $x_i$ (degrees Celsius).
+我们可以通过数据的散点图来观察 $y_i$(冰淇淋销售额(美元(\$\'s))和 $x_i$(摄氏度)之间的关系。
```{code-cell} ipython3
---
mystnb:
figure:
- caption: "Scatter plot"
+ caption: "散点图"
name: sales-v-temp1
---
ax = df.plot(
x='X',
y='Y',
kind='scatter',
- ylabel='Ice-cream sales ($\'s)',
- xlabel='Degrees celcius'
+ ylabel='冰淇淋销售额(\$)',
+ xlabel='摄氏度'
)
```
-as you can see the data suggests that more ice-cream is typically sold on hotter days.
+如您所见,数据表明在更热的日子里通常会卖出更多的冰淇淋。
-To build a linear model of the data we need to choose values for $\alpha$ and $\beta$ that represents a line of "best" fit such that
+为了建立数据的线性模型,我们需要选择代表“最佳”拟合线的 $\alpha$ 和 $\beta$ 值,使得
$$
\hat{y_i} = \hat{\alpha} + \hat{\beta} x_i
$$
-Let's start with $\alpha = 5$ and $\beta = 10$
+让我们从 $\alpha = 5$ 和 $\beta = 10$ 开始
```{code-cell} ipython3
α = 5
@@ -92,7 +97,7 @@ df['Y_hat'] = α + β * df['X']
---
mystnb:
figure:
- caption: "Scatter plot with a line of fit"
+ caption: "带有拟合线的散点图"
name: sales-v-temp2
---
fig, ax = plt.subplots()
@@ -101,9 +106,9 @@ ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
plt.show()
```
-We can see that this model does a poor job of estimating the relationship.
+我们可以看到这个模型在估计关系上做得很差。
-We can continue to guess and iterate towards a line of "best" fit by adjusting the parameters
+我们可以继续通过调整参数来试图迭代并逼近“最佳”拟合线。
```{code-cell} ipython3
β = 100
@@ -114,7 +119,7 @@ df['Y_hat'] = α + β * df['X']
---
mystnb:
figure:
- caption: "Scatter plot with a line of fit #2"
+ caption: "带拟合线的散点图 #2"
name: sales-v-temp3
---
fig, ax = plt.subplots()
@@ -132,7 +137,7 @@ df['Y_hat'] = α + β * df['X']
---
mystnb:
figure:
- caption: "Scatter plot with a line of fit #3"
+ caption: "带拟合线的散点图 #3"
name: sales-v-temp4
---
fig, ax = plt.subplots()
@@ -141,9 +146,9 @@ ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
plt.show()
```
-However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem.
+但是我们需要考虑将这个猜测过程正式化,把这个问题看作是一个优化问题。
-Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals
+让我们考虑误差 $\epsilon_i$ 并定义观测值 $y_i$ 与估计值 $\hat{y}_i$ 之间的差异,我们将其称为残差
$$
\begin{aligned}
@@ -164,7 +169,7 @@ df
---
mystnb:
figure:
- caption: "Plot of the residuals"
+ caption: "残差图"
name: plt-residuals
---
fig, ax = plt.subplots()
@@ -174,32 +179,32 @@ plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r')
plt.show()
```
-The Ordinary Least Squares (OLS) method chooses $\alpha$ and $\beta$ in such a way that **minimizes** the sum of the squared residuals (SSR).
+普通最小二乘方法 (OLS) 选择 $\alpha$ 和 $\beta$,以使残差平方和 (SSR) **最小化**。
$$
\min_{\alpha,\beta} \sum_{i=1}^{N}{\hat{e}_i^2} = \min_{\alpha,\beta} \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
$$
-Let's call this a cost function
+我们称之为成本函数
$$
C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}
$$
-that we would like to minimize with parameters $\alpha$ and $\beta$.
+我们希望通过参数 $\alpha$ 和 $\beta$ 来最小化这个成本函数。
-## How does error change with respect to $\alpha$ and $\beta$
+## 错误相对于 $\alpha$ 和 $\beta$ 的变化
-Let us first look at how the total error changes with respect to $\beta$ (holding the intercept $\alpha$ constant)
+首先让我们看看总误差相对于 $\beta$ 的变化(保持截距 $\alpha$ 不变)
-We know from [the next section](slr:optimal-values) the optimal values for $\alpha$ and $\beta$ are:
+我们从[下一节](slr:optimal-values)知道 $\alpha$ 和 $\beta$ 的最优值是:
```{code-cell} ipython3
β_optimal = 64.38
α_optimal = -14.72
```
-We can then calculate the error for a range of $\beta$ values
+我们可以计算一个范围内的 $\beta$ 值的错误
```{code-cell} ipython3
errors = {}
@@ -207,20 +212,20 @@ for β in np.arange(20,100,0.5):
errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum()
```
-Plotting the error
+绘制错误图
```{code-cell} ipython3
---
mystnb:
figure:
- caption: "Plotting the error"
+ caption: "绘制错误图"
name: plt-errors
---
ax = pd.Series(errors).plot(xlabel='β', ylabel='error')
plt.axvline(β_optimal, color='r');
```
-Now let us vary $\alpha$ (holding $\beta$ constant)
+现在我们改变 $\alpha$ (保持 $\beta$ 不变)
```{code-cell} ipython3
errors = {}
@@ -228,13 +233,13 @@ for α in np.arange(-500,500,5):
errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum()
```
-Plotting the error
+绘制错误图
```{code-cell} ipython3
---
mystnb:
figure:
- caption: "Plotting the error (2)"
+ caption: "绘制错误图 (2)"
name: plt-errors-2
---
ax = pd.Series(errors).plot(xlabel='α', ylabel='error')
@@ -242,136 +247,136 @@ plt.axvline(α_optimal, color='r');
```
(slr:optimal-values)=
-## Calculating optimal values
+## 计算最优值
-Now let us use calculus to solve the optimization problem and compute the optimal values for $\alpha$ and $\beta$ to find the ordinary least squares solution.
+现在让我们使用微积分来解决优化问题,并计算出 $\alpha$ 和 $\beta$ 的最优值,以找到普通最小二乘解。
-First taking the partial derivative with respect to $\alpha$
+首先对 $\alpha$ 取偏导
$$
\frac{\partial C}{\partial \alpha}[\sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}]
$$
-and setting it equal to $0$
+并将其设为 $0$
$$
0 = \sum_{i=1}^{N}{-2(y_i - \alpha - \beta x_i)}
$$
-we can remove the constant $-2$ from the summation by dividing both sides by $-2$
+我们可以通过两边除以 $-2$ 来移除求和中的常数 $-2$
$$
0 = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)}
$$
-Now we can split this equation up into the components
+现在我们可以将这个方程分解为各个组成部分
$$
0 = \sum_{i=1}^{N}{y_i} - \sum_{i=1}^{N}{\alpha} - \beta \sum_{i=1}^{N}{x_i}
$$
-The middle term is a straight forward sum from $i=1,...N$ by a constant $\alpha$
+中间项是从 $i=1,...N$ 对常数 $\alpha$ 进行简单求和
$$
0 = \sum_{i=1}^{N}{y_i} - N*\alpha - \beta \sum_{i=1}^{N}{x_i}
$$
-and rearranging terms
+并重新排列各项
$$
\alpha = \frac{\sum_{i=1}^{N}{y_i} - \beta \sum_{i=1}^{N}{x_i}}{N}
$$
-We observe that both fractions resolve to the means $\bar{y_i}$ and $\bar{x_i}$
+我们观察到两个分数分别归结为均值 $\bar{y_i}$ 和 $\bar{x_i}$
$$
\alpha = \bar{y_i} - \beta\bar{x_i}
$$ (eq:optimal-alpha)
-Now let's take the partial derivative of the cost function $C$ with respect to $\beta$
+现在让我们对成本函数 $C$ 关于 $\beta$ 取偏导
$$
\frac{\partial C}{\partial \beta}[\sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}]
$$
-and setting it equal to $0$
+并将其设为 $0$
$$
0 = \sum_{i=1}^{N}{-2 x_i (y_i - \alpha - \beta x_i)}
$$
-we can again take the constant outside of the summation and divide both sides by $-2$
+我们可以再次将常数从求和中取出,并将两边除以 $-2$
$$
0 = \sum_{i=1}^{N}{x_i (y_i - \alpha - \beta x_i)}
$$
-which becomes
+这变成了
$$
0 = \sum_{i=1}^{N}{(x_i y_i - \alpha x_i - \beta x_i^2)}
$$
-now substituting for $\alpha$
+现在代入 $\alpha$
$$
0 = \sum_{i=1}^{N}{(x_i y_i - (\bar{y_i} - \beta \bar{x_i}) x_i - \beta x_i^2)}
$$
-and rearranging terms
+并重新排列各项
$$
0 = \sum_{i=1}^{N}{(x_i y_i - \bar{y_i} x_i - \beta \bar{x_i} x_i - \beta x_i^2)}
$$
-This can be split into two summations
+这可以被分成两个求和
$$
0 = \sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i) + \beta \sum_{i=1}^{N}(\bar{x_i} x_i - x_i^2)
$$
-and solving for $\beta$ yields
+解$\beta$得到
$$
\beta = \frac{\sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i)}{\sum_{i=1}^{N}(x_i^2 - \bar{x_i} x_i)}
$$ (eq:optimal-beta)
-We can now use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to calculate the optimal values for $\alpha$ and $\beta$
+我们现在可以使用{eq}`eq:optimal-alpha` 和 {eq}`eq:optimal-beta` 来计算$\alpha$和$\beta$的最优值
-Calculating $\beta$
+计算$\beta$
```{code-cell} ipython3
-df = df[['X','Y']].copy() # Original Data
+df = df[['X','Y']].copy() # 原始数据
-# Calculate the sample means
+# 计算样本均值
x_bar = df['X'].mean()
y_bar = df['Y'].mean()
```
-Now computing across the 10 observations and then summing the numerator and denominator
+现在计算10个观察值,然后求和分子和分母
```{code-cell} ipython3
-# Compute the Sums
+# 计算求和
df['num'] = df['X'] * df['Y'] - y_bar * df['X']
df['den'] = pow(df['X'],2) - x_bar * df['X']
β = df['num'].sum() / df['den'].sum()
print(β)
```
-Calculating $\alpha$
+计算$\alpha$
```{code-cell} ipython3
α = y_bar - β * x_bar
print(α)
```
-Now we can plot the OLS solution
+现在我们可以绘制OLS解决方案
```{code-cell} ipython3
---
mystnb:
figure:
- caption: "OLS line of best fit"
+ caption: "OLS最佳拟合线"
name: plt-ols
---
df['Y_hat'] = α + β * df['X']
@@ -386,31 +391,31 @@ plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
:::{exercise}
:label: slr-ex1
-Now that you know the equations that solve the simple linear regression model using OLS you can now run your own regressions to build a model between $y$ and $x$.
+现在您已经知道了使用OLS解决简单线性回归模型的方程,您可以开始运行自己的回归以构建$y$和$x$之间的模型了。
-Let's consider two economic variables GDP per capita and Life Expectancy.
+让我们考虑两个经济变量,人均GDP和预期寿命。
-1. What do you think their relationship would be?
-2. Gather some data [from our world in data](https://ourworldindata.org)
-3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
-4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$
-5. Plot the line of best fit found using OLS
-6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy
+1. 你认为它们之间的关系会是怎样的?
+2. 从[我们的世界数据中](https://ourworldindata.org)搜集一些数据
+3. 使用`pandas`导入`csv`格式的数据,并绘制几个不同国家的图表
+4. 使用{eq}`eq:optimal-alpha` 和 {eq}`eq:optimal-beta`计算$\alpha$和$\beta$的最优值
+5. 使用OLS绘制最佳拟合线
+6. 解释系数并写出人均GDP和预期寿命之间关系的总结句子
:::
:::{solution-start} slr-ex1
:::
-**Q2:** Gather some data [from our world in data](https://ourworldindata.org)
+**Q2:** 搜集一些数据 [来自我们的世界数据](https://ourworldindata.org)
:::{raw} html
:::
-You can download {download}`a copy of the data here ` if you get stuck
+如果你遇到困难,可以从这里下载{download}`数据副本 `
-**Q3:** Use `pandas` to import the `csv` formatted data and plot a few different countries of interest
+**Q3:** 使用`pandas`导入`csv`格式的数据并绘制几个不同国家的兴趣图表
```{code-cell} ipython3
data_url = "https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv"
@@ -421,13 +426,13 @@ df = pd.read_csv(data_url, nrows=10)
df
```
-You can see that the data downloaded from Our World in Data has provided a global set of countries with the GDP per capita and Life Expectancy Data.
+您可以看到从我们的世界数据下载的数据为全球各国提供了人均GDP和预期寿命数据。
-It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your DataFrame.
+首先从csv文件中导入几行数据以了解其结构,以便您可以选择要读取到DataFrame中的列,这通常是一个好主意。
-You can observe that there are a bunch of columns we won't need to import such as `Continent`
+您可以观察到有许多我们不需要导入的列,比如`Continent`
-So let's built a list of the columns we want to import
+那么我们来构建一个我们想要导入的列的列表
```{code-cell} ipython3
cols = ['Code', 'Year', 'Life expectancy at birth (historical)', 'GDP per capita']
@@ -435,14 +440,14 @@ df = pd.read_csv(data_url, usecols=cols)
df
```
-Sometimes it can be useful to rename your columns to make it easier to work with in the DataFrame
+有时候重命名列名可以使得在DataFrame中更容易操作
```{code-cell} ipython3
df.columns = ["cntry", "year", "life_expectancy", "gdppc"]
df
```
-We can see there are `NaN` values which represents missing data so let us go ahead and drop those
+我们可以看到存在`NaN`值,这表示缺失数据,所以让我们继续删除这些数据
```{code-cell} ipython3
df.dropna(inplace=True)
@@ -452,69 +457,69 @@ df.dropna(inplace=True)
df
```
-We have now dropped the number of rows in our DataFrame from 62156 to 12445 removing a lot of empty data relationships.
+我们现在已经将我们的DataFrame的行数从62156减少到12445,删除了很多空的数据关系。
-Now we have a dataset containing life expectancy and GDP per capita for a range of years.
+现在我们有一个包含一系列年份的人均寿命和人均GDP的数据集。
-It is always a good idea to spend a bit of time understanding what data you actually have.
+花点时间了解你实际拥有的数据总是一个好主意。
-For example, you may want to explore this data to see if there is consistent reporting for all countries across years
+例如,您可能想要探索这些数据,看看是否所有国家在各年之间的报告都是一致的。
-Let's first look at the Life Expectancy Data
+让我们首先看看寿命数据
```{code-cell} ipython3
le_years = df[['cntry', 'year', 'life_expectancy']].set_index(['cntry', 'year']).unstack()['life_expectancy']
le_years
```
-As you can see there are a lot of countries where data is not available for the Year 1543!
+如您所见,有很多国家在1543年的数据是不可用的!
-Which country does report this data?
+哪个国家报告了这些数据?
```{code-cell} ipython3
le_years[~le_years[1543].isna()]
```
-You can see that Great Britain (GBR) is the only one available
+您可以看到,只有大不列颠(GBR)是可用的
-You can also take a closer look at the time series to find that it is also non-continuous, even for GBR.
+您还可以更仔细地观察时间序列,发现即使对于GBR,它也是不连续的。
```{code-cell} ipython3
le_years.loc['GBR'].plot()
```
-In fact we can use pandas to quickly check how many countries are captured in each year
+实际上我们可以使用pandas快速检查每个年份涵盖了多少个国家
```{code-cell} ipython3
le_years.stack().unstack(level=0).count(axis=1).plot(xlabel="Year", ylabel="Number of countries");
```
-So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries
+所以很明显,如果你进行横断面比较,那么最近的数据将包括更广泛的国家集合
-Now let us consider the most recent year in the dataset 2018
+现在让我们考虑数据集中最近的一年2018
```{code-cell} ipython3
df = df[df.year == 2018].reset_index(drop=True).copy()
```
```{code-cell} ipython3
-df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",);
+df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",);
```
-This data shows a couple of interesting relationships.
+这些数据显示了一些有趣的关系。
-1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectancy
-2. there appears to be a positive relationship between GDP per capita and life expectancy. Countries with higher GDP per capita tend to have higher life expectancy outcomes
+1. 许多国家的人均GDP相近,但寿命差别很大
+2. 人均GDP与预期寿命之间似乎存在正向关系。人均GDP较高的国家往往拥有更高的预期寿命
-Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables.
+尽管普通最小二乘法(OLS)是用来解线性方程的,但我们可以通过对变量进行转换(例如对数变换),然后使用OLS来估计转换后的变量。
-By specifying `logx` you can plot the GDP per Capita data on a log scale
+通过指定 `logx` 你可以在对数尺度上绘制人均GDP数据
```{code-cell} ipython3
-df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)", logx=True);
+df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="人均GDP", ylabel="预期寿命(年)", logx=True);
```
-As you can see from this transformation -- a linear model fits the shape of the data more closely.
+从这次转换可以看出,线性模型更贴近数据的形状。
```{code-cell} ipython3
df['log_gdppc'] = df['gdppc'].apply(np.log10)
@@ -524,12 +529,12 @@ df['log_gdppc'] = df['gdppc'].apply(np.log10)
df
```
-**Q4:** Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$
+**Q4:** 使用 {eq}`eq:optimal-alpha` 和 {eq}`eq:optimal-beta` 来计算 $\alpha$ 和 $\beta$ 的最优值
```{code-cell} ipython3
-data = df[['log_gdppc', 'life_expectancy']].copy() # Get Data from DataFrame
+data = df[['log_gdppc', 'life_expectancy']].copy() # 从DataFrame中提取数据
-# Calculate the sample means
+# 计算样本均值
x_bar = data['log_gdppc'].mean()
y_bar = data['life_expectancy'].mean()
```
@@ -539,7 +544,7 @@ data
```
```{code-cell} ipython3
-# Compute the Sums
+# 计算求和
data['num'] = data['log_gdppc'] * data['life_expectancy'] - y_bar * data['log_gdppc']
data['den'] = pow(data['log_gdppc'],2) - x_bar * data['log_gdppc']
β = data['num'].sum() / data['den'].sum()
@@ -551,7 +556,7 @@ print(β)
print(α)
```
-**Q5:** Plot the line of best fit found using OLS
+**Q5:** 绘制使用 OLS 找到的最佳拟合线
```{code-cell} ipython3
data['life_expectancy_hat'] = α + β * df['log_gdppc']
@@ -569,9 +574,9 @@ plt.vlines(data['log_gdppc'], data['life_expectancy_hat'], data['life_expectancy
:::{exercise}
:label: slr-ex2
-Minimizing the sum of squares is not the **only** way to generate the line of best fit.
+通过最小化平方和并不是生成最佳拟合线的**唯一**方法。
-For example, we could also consider minimizing the sum of the **absolute values**, that would give less weight to outliers.
+例如,我们还可以考虑最小化**绝对值之和**,这样对异常值的权重会更小。
-Solve for $\alpha$ and $\beta$ using the least absolute values
-:::
+求解 $\alpha$ 和 $\beta$ 使用最小绝对值法
+:::
\ No newline at end of file