From 5d14566a19ae20ff1f292209e32af7208040cf0b Mon Sep 17 00:00:00 2001 From: JingkunZhao <155940781+SylviaZhaooo@users.noreply.github.com> Date: Fri, 27 Sep 2024 15:07:13 +1000 Subject: [PATCH] [simple_linear_regression] Translation (#60) --- lectures/simple_linear_regression.md | 215 ++++++++++++++------------- 1 file changed, 110 insertions(+), 105 deletions(-) diff --git a/lectures/simple_linear_regression.md b/lectures/simple_linear_regression.md index 137d453..83a8ec3 100644 --- a/lectures/simple_linear_regression.md +++ b/lectures/simple_linear_regression.md @@ -11,25 +11,30 @@ kernelspec: name: python3 --- -# Simple Linear Regression Model +# 简单线性回归模型 ```{code-cell} ipython3 import numpy as np import pandas as pd +import matplotlib as mpl import matplotlib.pyplot as plt + +FONTPATH = "fonts/SourceHanSerifSC-SemiBold.otf" +mpl.font_manager.fontManager.addfont(FONTPATH) +plt.rcParams['font.family'] = ['Source Han Serif SC'] ``` -The simple regression model estimates the relationship between two variables $x_i$ and $y_i$ +简单回归模型估计两个变量 $x_i$ 和 $y_i$ 之间的关系 $$ y_i = \alpha + \beta x_i + \epsilon_i, i = 1,2,...,N $$ -where $\epsilon_i$ represents the error between the line of best fit and the sample values for $y_i$ given $x_i$. +其中 $\epsilon_i$ 表示最佳拟合线与样本值 $y_i$ 与 $x_i$ 的误差。 -Our goal is to choose values for $\alpha$ and $\beta$ to build a line of "best" fit for some data that is available for variables $x_i$ and $y_i$. +我们的目标是为 $\alpha$ 和 $\beta$ 选择值来为一些可用的变量 $x_i$ 和 $y_i$ 的数据构建“最佳”拟合线。 -Let us consider a simple dataset of 10 observations for variables $x_i$ and $y_i$: +让我们考虑一个具有10个观察值的简单数据集,变量为 $x_i$ 和 $y_i$: | | $y_i$ | $x_i$ | |-|---|---| @@ -44,7 +49,7 @@ Let us consider a simple dataset of 10 observations for variables $x_i$ and $y_i |9| 1800 | 27 | |10 | 250 | 2 | -Let us think about $y_i$ as sales for an ice-cream cart, while $x_i$ is a variable that records the day's temperature in Celsius. +让我们把 $y_i$ 视为一个冰淇淋车的销售额,而 $x_i$ 是记录当天摄氏度温度的变量。 ```{code-cell} ipython3 x = [32, 21, 24, 35, 10, 11, 22, 21, 27, 2] @@ -54,33 +59,33 @@ df.columns = ['X', 'Y'] df ``` -We can use a scatter plot of the data to see the relationship between $y_i$ (ice-cream sales in dollars (\$\'s)) and $x_i$ (degrees Celsius). +我们可以通过数据的散点图来观察 $y_i$(冰淇淋销售额(美元(\$\'s))和 $x_i$(摄氏度)之间的关系。 ```{code-cell} ipython3 --- mystnb: figure: - caption: "Scatter plot" + caption: "散点图" name: sales-v-temp1 --- ax = df.plot( x='X', y='Y', kind='scatter', - ylabel='Ice-cream sales ($\'s)', - xlabel='Degrees celcius' + ylabel='冰淇淋销售额(\$)', + xlabel='摄氏度' ) ``` -as you can see the data suggests that more ice-cream is typically sold on hotter days. +如您所见,数据表明在更热的日子里通常会卖出更多的冰淇淋。 -To build a linear model of the data we need to choose values for $\alpha$ and $\beta$ that represents a line of "best" fit such that +为了建立数据的线性模型,我们需要选择代表“最佳”拟合线的 $\alpha$ 和 $\beta$ 值,使得 $$ \hat{y_i} = \hat{\alpha} + \hat{\beta} x_i $$ -Let's start with $\alpha = 5$ and $\beta = 10$ +让我们从 $\alpha = 5$ 和 $\beta = 10$ 开始 ```{code-cell} ipython3 α = 5 @@ -92,7 +97,7 @@ df['Y_hat'] = α + β * df['X'] --- mystnb: figure: - caption: "Scatter plot with a line of fit" + caption: "带有拟合线的散点图" name: sales-v-temp2 --- fig, ax = plt.subplots() @@ -101,9 +106,9 @@ ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax) plt.show() ``` -We can see that this model does a poor job of estimating the relationship. +我们可以看到这个模型在估计关系上做得很差。 -We can continue to guess and iterate towards a line of "best" fit by adjusting the parameters +我们可以继续通过调整参数来试图迭代并逼近“最佳”拟合线。 ```{code-cell} ipython3 β = 100 @@ -114,7 +119,7 @@ df['Y_hat'] = α + β * df['X'] --- mystnb: figure: - caption: "Scatter plot with a line of fit #2" + caption: "带拟合线的散点图 #2" name: sales-v-temp3 --- fig, ax = plt.subplots() @@ -132,7 +137,7 @@ df['Y_hat'] = α + β * df['X'] --- mystnb: figure: - caption: "Scatter plot with a line of fit #3" + caption: "带拟合线的散点图 #3" name: sales-v-temp4 --- fig, ax = plt.subplots() @@ -141,9 +146,9 @@ ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g') plt.show() ``` -However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem. +但是我们需要考虑将这个猜测过程正式化,把这个问题看作是一个优化问题。 -Let's consider the error $\epsilon_i$ and define the difference between the observed values $y_i$ and the estimated values $\hat{y}_i$ which we will call the residuals +让我们考虑误差 $\epsilon_i$ 并定义观测值 $y_i$ 与估计值 $\hat{y}_i$ 之间的差异,我们将其称为残差 $$ \begin{aligned} @@ -164,7 +169,7 @@ df --- mystnb: figure: - caption: "Plot of the residuals" + caption: "残差图" name: plt-residuals --- fig, ax = plt.subplots() @@ -174,32 +179,32 @@ plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r') plt.show() ``` -The Ordinary Least Squares (OLS) method chooses $\alpha$ and $\beta$ in such a way that **minimizes** the sum of the squared residuals (SSR). +普通最小二乘方法 (OLS) 选择 $\alpha$ 和 $\beta$,以使残差平方和 (SSR) **最小化**。 $$ \min_{\alpha,\beta} \sum_{i=1}^{N}{\hat{e}_i^2} = \min_{\alpha,\beta} \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2} $$ -Let's call this a cost function +我们称之为成本函数 $$ C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2} $$ -that we would like to minimize with parameters $\alpha$ and $\beta$. +我们希望通过参数 $\alpha$ 和 $\beta$ 来最小化这个成本函数。 -## How does error change with respect to $\alpha$ and $\beta$ +## 错误相对于 $\alpha$ 和 $\beta$ 的变化 -Let us first look at how the total error changes with respect to $\beta$ (holding the intercept $\alpha$ constant) +首先让我们看看总误差相对于 $\beta$ 的变化(保持截距 $\alpha$ 不变) -We know from [the next section](slr:optimal-values) the optimal values for $\alpha$ and $\beta$ are: +我们从[下一节](slr:optimal-values)知道 $\alpha$ 和 $\beta$ 的最优值是: ```{code-cell} ipython3 β_optimal = 64.38 α_optimal = -14.72 ``` -We can then calculate the error for a range of $\beta$ values +我们可以计算一个范围内的 $\beta$ 值的错误 ```{code-cell} ipython3 errors = {} @@ -207,20 +212,20 @@ for β in np.arange(20,100,0.5): errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum() ``` -Plotting the error +绘制错误图 ```{code-cell} ipython3 --- mystnb: figure: - caption: "Plotting the error" + caption: "绘制错误图" name: plt-errors --- ax = pd.Series(errors).plot(xlabel='β', ylabel='error') plt.axvline(β_optimal, color='r'); ``` -Now let us vary $\alpha$ (holding $\beta$ constant) +现在我们改变 $\alpha$ (保持 $\beta$ 不变) ```{code-cell} ipython3 errors = {} @@ -228,13 +233,13 @@ for α in np.arange(-500,500,5): errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum() ``` -Plotting the error +绘制错误图 ```{code-cell} ipython3 --- mystnb: figure: - caption: "Plotting the error (2)" + caption: "绘制错误图 (2)" name: plt-errors-2 --- ax = pd.Series(errors).plot(xlabel='α', ylabel='error') @@ -242,136 +247,136 @@ plt.axvline(α_optimal, color='r'); ``` (slr:optimal-values)= -## Calculating optimal values +## 计算最优值 -Now let us use calculus to solve the optimization problem and compute the optimal values for $\alpha$ and $\beta$ to find the ordinary least squares solution. +现在让我们使用微积分来解决优化问题,并计算出 $\alpha$ 和 $\beta$ 的最优值,以找到普通最小二乘解。 -First taking the partial derivative with respect to $\alpha$ +首先对 $\alpha$ 取偏导 $$ \frac{\partial C}{\partial \alpha}[\sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}] $$ -and setting it equal to $0$ +并将其设为 $0$ $$ 0 = \sum_{i=1}^{N}{-2(y_i - \alpha - \beta x_i)} $$ -we can remove the constant $-2$ from the summation by dividing both sides by $-2$ +我们可以通过两边除以 $-2$ 来移除求和中的常数 $-2$ $$ 0 = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)} $$ -Now we can split this equation up into the components +现在我们可以将这个方程分解为各个组成部分 $$ 0 = \sum_{i=1}^{N}{y_i} - \sum_{i=1}^{N}{\alpha} - \beta \sum_{i=1}^{N}{x_i} $$ -The middle term is a straight forward sum from $i=1,...N$ by a constant $\alpha$ +中间项是从 $i=1,...N$ 对常数 $\alpha$ 进行简单求和 $$ 0 = \sum_{i=1}^{N}{y_i} - N*\alpha - \beta \sum_{i=1}^{N}{x_i} $$ -and rearranging terms +并重新排列各项 $$ \alpha = \frac{\sum_{i=1}^{N}{y_i} - \beta \sum_{i=1}^{N}{x_i}}{N} $$ -We observe that both fractions resolve to the means $\bar{y_i}$ and $\bar{x_i}$ +我们观察到两个分数分别归结为均值 $\bar{y_i}$ 和 $\bar{x_i}$ $$ \alpha = \bar{y_i} - \beta\bar{x_i} $$ (eq:optimal-alpha) -Now let's take the partial derivative of the cost function $C$ with respect to $\beta$ +现在让我们对成本函数 $C$ 关于 $\beta$ 取偏导 $$ \frac{\partial C}{\partial \beta}[\sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}] $$ -and setting it equal to $0$ +并将其设为 $0$ $$ 0 = \sum_{i=1}^{N}{-2 x_i (y_i - \alpha - \beta x_i)} $$ -we can again take the constant outside of the summation and divide both sides by $-2$ +我们可以再次将常数从求和中取出,并将两边除以 $-2$ $$ 0 = \sum_{i=1}^{N}{x_i (y_i - \alpha - \beta x_i)} $$ -which becomes +这变成了 $$ 0 = \sum_{i=1}^{N}{(x_i y_i - \alpha x_i - \beta x_i^2)} $$ -now substituting for $\alpha$ +现在代入 $\alpha$ $$ 0 = \sum_{i=1}^{N}{(x_i y_i - (\bar{y_i} - \beta \bar{x_i}) x_i - \beta x_i^2)} $$ -and rearranging terms +并重新排列各项 $$ 0 = \sum_{i=1}^{N}{(x_i y_i - \bar{y_i} x_i - \beta \bar{x_i} x_i - \beta x_i^2)} $$ -This can be split into two summations +这可以被分成两个求和 $$ 0 = \sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i) + \beta \sum_{i=1}^{N}(\bar{x_i} x_i - x_i^2) $$ -and solving for $\beta$ yields +解$\beta$得到 $$ \beta = \frac{\sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i)}{\sum_{i=1}^{N}(x_i^2 - \bar{x_i} x_i)} $$ (eq:optimal-beta) -We can now use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to calculate the optimal values for $\alpha$ and $\beta$ +我们现在可以使用{eq}`eq:optimal-alpha` 和 {eq}`eq:optimal-beta` 来计算$\alpha$和$\beta$的最优值 -Calculating $\beta$ +计算$\beta$ ```{code-cell} ipython3 -df = df[['X','Y']].copy() # Original Data +df = df[['X','Y']].copy() # 原始数据 -# Calculate the sample means +# 计算样本均值 x_bar = df['X'].mean() y_bar = df['Y'].mean() ``` -Now computing across the 10 observations and then summing the numerator and denominator +现在计算10个观察值,然后求和分子和分母 ```{code-cell} ipython3 -# Compute the Sums +# 计算求和 df['num'] = df['X'] * df['Y'] - y_bar * df['X'] df['den'] = pow(df['X'],2) - x_bar * df['X'] β = df['num'].sum() / df['den'].sum() print(β) ``` -Calculating $\alpha$ +计算$\alpha$ ```{code-cell} ipython3 α = y_bar - β * x_bar print(α) ``` -Now we can plot the OLS solution +现在我们可以绘制OLS解决方案 ```{code-cell} ipython3 --- mystnb: figure: - caption: "OLS line of best fit" + caption: "OLS最佳拟合线" name: plt-ols --- df['Y_hat'] = α + β * df['X'] @@ -386,31 +391,31 @@ plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r'); :::{exercise} :label: slr-ex1 -Now that you know the equations that solve the simple linear regression model using OLS you can now run your own regressions to build a model between $y$ and $x$. +现在您已经知道了使用OLS解决简单线性回归模型的方程,您可以开始运行自己的回归以构建$y$和$x$之间的模型了。 -Let's consider two economic variables GDP per capita and Life Expectancy. +让我们考虑两个经济变量,人均GDP和预期寿命。 -1. What do you think their relationship would be? -2. Gather some data [from our world in data](https://ourworldindata.org) -3. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest -4. Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$ -5. Plot the line of best fit found using OLS -6. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy +1. 你认为它们之间的关系会是怎样的? +2. 从[我们的世界数据中](https://ourworldindata.org)搜集一些数据 +3. 使用`pandas`导入`csv`格式的数据,并绘制几个不同国家的图表 +4. 使用{eq}`eq:optimal-alpha` 和 {eq}`eq:optimal-beta`计算$\alpha$和$\beta$的最优值 +5. 使用OLS绘制最佳拟合线 +6. 解释系数并写出人均GDP和预期寿命之间关系的总结句子 ::: :::{solution-start} slr-ex1 ::: -**Q2:** Gather some data [from our world in data](https://ourworldindata.org) +**Q2:** 搜集一些数据 [来自我们的世界数据](https://ourworldindata.org) :::{raw} html ::: -You can download {download}`a copy of the data here ` if you get stuck +如果你遇到困难,可以从这里下载{download}`数据副本 ` -**Q3:** Use `pandas` to import the `csv` formatted data and plot a few different countries of interest +**Q3:** 使用`pandas`导入`csv`格式的数据并绘制几个不同国家的兴趣图表 ```{code-cell} ipython3 data_url = "https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv" @@ -421,13 +426,13 @@ df = pd.read_csv(data_url, nrows=10) df ``` -You can see that the data downloaded from Our World in Data has provided a global set of countries with the GDP per capita and Life Expectancy Data. +您可以看到从我们的世界数据下载的数据为全球各国提供了人均GDP和预期寿命数据。 -It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your DataFrame. +首先从csv文件中导入几行数据以了解其结构,以便您可以选择要读取到DataFrame中的列,这通常是一个好主意。 -You can observe that there are a bunch of columns we won't need to import such as `Continent` +您可以观察到有许多我们不需要导入的列,比如`Continent` -So let's built a list of the columns we want to import +那么我们来构建一个我们想要导入的列的列表 ```{code-cell} ipython3 cols = ['Code', 'Year', 'Life expectancy at birth (historical)', 'GDP per capita'] @@ -435,14 +440,14 @@ df = pd.read_csv(data_url, usecols=cols) df ``` -Sometimes it can be useful to rename your columns to make it easier to work with in the DataFrame +有时候重命名列名可以使得在DataFrame中更容易操作 ```{code-cell} ipython3 df.columns = ["cntry", "year", "life_expectancy", "gdppc"] df ``` -We can see there are `NaN` values which represents missing data so let us go ahead and drop those +我们可以看到存在`NaN`值,这表示缺失数据,所以让我们继续删除这些数据 ```{code-cell} ipython3 df.dropna(inplace=True) @@ -452,69 +457,69 @@ df.dropna(inplace=True) df ``` -We have now dropped the number of rows in our DataFrame from 62156 to 12445 removing a lot of empty data relationships. +我们现在已经将我们的DataFrame的行数从62156减少到12445,删除了很多空的数据关系。 -Now we have a dataset containing life expectancy and GDP per capita for a range of years. +现在我们有一个包含一系列年份的人均寿命和人均GDP的数据集。 -It is always a good idea to spend a bit of time understanding what data you actually have. +花点时间了解你实际拥有的数据总是一个好主意。 -For example, you may want to explore this data to see if there is consistent reporting for all countries across years +例如,您可能想要探索这些数据,看看是否所有国家在各年之间的报告都是一致的。 -Let's first look at the Life Expectancy Data +让我们首先看看寿命数据 ```{code-cell} ipython3 le_years = df[['cntry', 'year', 'life_expectancy']].set_index(['cntry', 'year']).unstack()['life_expectancy'] le_years ``` -As you can see there are a lot of countries where data is not available for the Year 1543! +如您所见,有很多国家在1543年的数据是不可用的! -Which country does report this data? +哪个国家报告了这些数据? ```{code-cell} ipython3 le_years[~le_years[1543].isna()] ``` -You can see that Great Britain (GBR) is the only one available +您可以看到,只有大不列颠(GBR)是可用的 -You can also take a closer look at the time series to find that it is also non-continuous, even for GBR. +您还可以更仔细地观察时间序列,发现即使对于GBR,它也是不连续的。 ```{code-cell} ipython3 le_years.loc['GBR'].plot() ``` -In fact we can use pandas to quickly check how many countries are captured in each year +实际上我们可以使用pandas快速检查每个年份涵盖了多少个国家 ```{code-cell} ipython3 le_years.stack().unstack(level=0).count(axis=1).plot(xlabel="Year", ylabel="Number of countries"); ``` -So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries +所以很明显,如果你进行横断面比较,那么最近的数据将包括更广泛的国家集合 -Now let us consider the most recent year in the dataset 2018 +现在让我们考虑数据集中最近的一年2018 ```{code-cell} ipython3 df = df[df.year == 2018].reset_index(drop=True).copy() ``` ```{code-cell} ipython3 -df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",); +df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)",); ``` -This data shows a couple of interesting relationships. +这些数据显示了一些有趣的关系。 -1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectancy -2. there appears to be a positive relationship between GDP per capita and life expectancy. Countries with higher GDP per capita tend to have higher life expectancy outcomes +1. 许多国家的人均GDP相近,但寿命差别很大 +2. 人均GDP与预期寿命之间似乎存在正向关系。人均GDP较高的国家往往拥有更高的预期寿命 -Even though OLS is solving linear equations -- one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables. +尽管普通最小二乘法(OLS)是用来解线性方程的,但我们可以通过对变量进行转换(例如对数变换),然后使用OLS来估计转换后的变量。 -By specifying `logx` you can plot the GDP per Capita data on a log scale +通过指定 `logx` 你可以在对数尺度上绘制人均GDP数据 ```{code-cell} ipython3 -df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="GDP per capita", ylabel="Life expectancy (years)", logx=True); +df.plot(x='gdppc', y='life_expectancy', kind='scatter', xlabel="人均GDP", ylabel="预期寿命(年)", logx=True); ``` -As you can see from this transformation -- a linear model fits the shape of the data more closely. +从这次转换可以看出,线性模型更贴近数据的形状。 ```{code-cell} ipython3 df['log_gdppc'] = df['gdppc'].apply(np.log10) @@ -524,12 +529,12 @@ df['log_gdppc'] = df['gdppc'].apply(np.log10) df ``` -**Q4:** Use {eq}`eq:optimal-alpha` and {eq}`eq:optimal-beta` to compute optimal values for $\alpha$ and $\beta$ +**Q4:** 使用 {eq}`eq:optimal-alpha` 和 {eq}`eq:optimal-beta` 来计算 $\alpha$ 和 $\beta$ 的最优值 ```{code-cell} ipython3 -data = df[['log_gdppc', 'life_expectancy']].copy() # Get Data from DataFrame +data = df[['log_gdppc', 'life_expectancy']].copy() # 从DataFrame中提取数据 -# Calculate the sample means +# 计算样本均值 x_bar = data['log_gdppc'].mean() y_bar = data['life_expectancy'].mean() ``` @@ -539,7 +544,7 @@ data ``` ```{code-cell} ipython3 -# Compute the Sums +# 计算求和 data['num'] = data['log_gdppc'] * data['life_expectancy'] - y_bar * data['log_gdppc'] data['den'] = pow(data['log_gdppc'],2) - x_bar * data['log_gdppc'] β = data['num'].sum() / data['den'].sum() @@ -551,7 +556,7 @@ print(β) print(α) ``` -**Q5:** Plot the line of best fit found using OLS +**Q5:** 绘制使用 OLS 找到的最佳拟合线 ```{code-cell} ipython3 data['life_expectancy_hat'] = α + β * df['log_gdppc'] @@ -569,9 +574,9 @@ plt.vlines(data['log_gdppc'], data['life_expectancy_hat'], data['life_expectancy :::{exercise} :label: slr-ex2 -Minimizing the sum of squares is not the **only** way to generate the line of best fit. +通过最小化平方和并不是生成最佳拟合线的**唯一**方法。 -For example, we could also consider minimizing the sum of the **absolute values**, that would give less weight to outliers. +例如,我们还可以考虑最小化**绝对值之和**,这样对异常值的权重会更小。 -Solve for $\alpha$ and $\beta$ using the least absolute values -::: +求解 $\alpha$ 和 $\beta$ 使用最小绝对值法 +::: \ No newline at end of file