diff --git a/environment.yml b/environment.yml index 52036b80..d834b6a3 100644 --- a/environment.yml +++ b/environment.yml @@ -4,7 +4,7 @@ channels: - conda-forge dependencies: - python=3.11 - - anaconda=2024.02 + - anaconda=2024.06 - pip - pip: - jupyter-book==0.15.1 diff --git a/lectures/prob_dist.md b/lectures/prob_dist.md index 5bfa6200..2b619724 100644 --- a/lectures/prob_dist.md +++ b/lectures/prob_dist.md @@ -4,14 +4,13 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.14.5 + jupytext_version: 1.16.1 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- - # Distributions and Probabilities ```{index} single: Distributions and Probabilities @@ -23,6 +22,7 @@ In this lecture we give a quick introduction to data and probability distributio ```{code-cell} ipython3 :tags: [hide-output] + !pip install --upgrade yfinance ``` @@ -35,7 +35,6 @@ import scipy.stats import seaborn as sns ``` - ## Common distributions In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy. @@ -99,7 +98,6 @@ n = 10 u = scipy.stats.randint(1, n+1) ``` - Here's the mean and variance: ```{code-cell} ipython3 @@ -195,7 +193,6 @@ u.pmf(0) u.pmf(1) ``` - #### Binomial distribution Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF: @@ -232,7 +229,6 @@ Let's see if SciPy gives us the same results: u.mean(), u.var() ``` - Here's the PMF: ```{code-cell} ipython3 @@ -250,7 +246,6 @@ ax.set_ylabel('PMF') plt.show() ``` - Here's the CDF: ```{code-cell} ipython3 @@ -264,7 +259,6 @@ ax.set_ylabel('CDF') plt.show() ``` - ```{exercise} :label: prob_ex3 @@ -334,7 +328,6 @@ ax.set_ylabel('PMF') plt.show() ``` - #### Poisson distribution The Poisson distribution on $S = \{0, 1, \ldots\}$ with parameter $\lambda > 0$ has PMF @@ -372,7 +365,6 @@ ax.set_ylabel('PMF') plt.show() ``` - ### Continuous distributions @@ -449,7 +441,6 @@ plt.legend() plt.show() ``` - Here's a plot of the CDF: ```{code-cell} ipython3 @@ -466,7 +457,6 @@ plt.legend() plt.show() ``` - #### Lognormal distribution The **lognormal distribution** is a distribution on $\left(0, \infty\right)$ with density @@ -646,7 +636,6 @@ plt.legend() plt.show() ``` - #### Gamma distribution The **gamma distribution** is a distribution on $\left(0, \infty\right)$ with density @@ -730,7 +719,6 @@ df = pd.DataFrame(data, columns=['name', 'income']) df ``` - In this situation, we might refer to the set of their incomes as the "income distribution." The terminology is confusing because this set is not a probability distribution @@ -761,14 +749,10 @@ $$ For the income distribution given above, we can calculate these numbers via ```{code-cell} ipython3 -x = np.asarray(df['income']) # Pull out income as a NumPy array -``` - -```{code-cell} ipython3 +x = df['income'] x.mean(), x.var() ``` - ```{exercise} :label: prob_ex4 @@ -792,7 +776,6 @@ We will cover We can histogram the income distribution we just constructed as follows ```{code-cell} ipython3 -x = df['income'] fig, ax = plt.subplots() ax.hist(x, bins=5, density=True, histtype='bar') ax.set_xlabel('income') @@ -800,7 +783,6 @@ ax.set_ylabel('density') plt.show() ``` - Let's look at a distribution from real data. In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2024/1/1. @@ -811,25 +793,21 @@ So we will have one observation for each month. ```{code-cell} ipython3 :tags: [hide-output] -df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo' ) + +df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo') prices = df['Adj Close'] -data = prices.pct_change()[1:] * 100 -data.head() +x_amazon = prices.pct_change()[1:] * 100 +x_amazon.head() ``` - The first observation is the monthly return (percent change) over January 2000, which was ```{code-cell} ipython3 -data[0] +x_amazon.iloc[0] ``` Let's turn the return observations into an array and histogram it. -```{code-cell} ipython3 -x_amazon = np.asarray(data) -``` - ```{code-cell} ipython3 fig, ax = plt.subplots() ax.hist(x_amazon, bins=20) @@ -838,7 +816,6 @@ ax.set_ylabel('density') plt.show() ``` - #### Kernel density estimates Kernel density estimates (KDE) provide a simple way to estimate and visualize the density of a distribution. @@ -893,10 +870,10 @@ For example, let's compare the monthly returns on Amazon shares with the monthly ```{code-cell} ipython3 :tags: [hide-output] -df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo' ) + +df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo') prices = df['Adj Close'] -data = prices.pct_change()[1:] * 100 -x_costco = np.asarray(data) +x_costco = prices.pct_change()[1:] * 100 ``` ```{code-cell} ipython3 @@ -907,7 +884,6 @@ ax.set_xlabel('KDE') plt.show() ``` - ### Connection to probability distributions Let's discuss the connection between observed distributions and probability distributions. @@ -941,7 +917,6 @@ ax.set_ylabel('density') plt.show() ``` - The match between the histogram and the density is not bad but also not very good. One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about {ref}`heavy tailed distributions`. @@ -967,8 +942,6 @@ ax.set_ylabel('density') plt.show() ``` - Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better. This convergence is a version of the "law of large numbers", which we will discuss {ref}`later`. -