Skip to content

Commit

Permalink
FIX: Update python code to simplify and resolve FutureWarning (#540)
Browse files Browse the repository at this point in the history
* Misc edits to prob lecture

* fix variable name and minor formatting update

* add explanation for infinite support

* ENH: update code to simplify and resolve warnings

* remove all asarray

* address missed merge conflict issues

* remove extra x=df['income']

* FIX: set pd option to see if FutureWarning is resolved for inf and na

* revert test by setting pd option

* upgrade anaconda==2024.06

---------

Co-authored-by: John Stachurski <john.stachurski@gmail.com>
  • Loading branch information
mmcky and jstac authored Aug 1, 2024
1 parent 21a6894 commit a58c4af
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 39 deletions.
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ channels:
- conda-forge
dependencies:
- python=3.11
- anaconda=2024.02
- anaconda=2024.06
- pip
- pip:
- jupyter-book==0.15.1
Expand Down
49 changes: 11 additions & 38 deletions lectures/prob_dist.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,13 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.5
jupytext_version: 1.16.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---


# Distributions and Probabilities

```{index} single: Distributions and Probabilities
Expand All @@ -23,6 +22,7 @@ In this lecture we give a quick introduction to data and probability distributio

```{code-cell} ipython3
:tags: [hide-output]
!pip install --upgrade yfinance
```

Expand All @@ -35,7 +35,6 @@ import scipy.stats
import seaborn as sns
```


## Common distributions

In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.
Expand Down Expand Up @@ -99,7 +98,6 @@ n = 10
u = scipy.stats.randint(1, n+1)
```


Here's the mean and variance:

```{code-cell} ipython3
Expand Down Expand Up @@ -195,7 +193,6 @@ u.pmf(0)
u.pmf(1)
```


#### Binomial distribution

Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF:
Expand Down Expand Up @@ -232,7 +229,6 @@ Let's see if SciPy gives us the same results:
u.mean(), u.var()
```


Here's the PMF:

```{code-cell} ipython3
Expand All @@ -250,7 +246,6 @@ ax.set_ylabel('PMF')
plt.show()
```


Here's the CDF:

```{code-cell} ipython3
Expand All @@ -264,7 +259,6 @@ ax.set_ylabel('CDF')
plt.show()
```


```{exercise}
:label: prob_ex3
Expand Down Expand Up @@ -334,7 +328,6 @@ ax.set_ylabel('PMF')
plt.show()
```


#### Poisson distribution

The Poisson distribution on $S = \{0, 1, \ldots\}$ with parameter $\lambda > 0$ has PMF
Expand Down Expand Up @@ -372,7 +365,6 @@ ax.set_ylabel('PMF')
plt.show()
```


### Continuous distributions


Expand Down Expand Up @@ -449,7 +441,6 @@ plt.legend()
plt.show()
```


Here's a plot of the CDF:

```{code-cell} ipython3
Expand All @@ -466,7 +457,6 @@ plt.legend()
plt.show()
```


#### Lognormal distribution

The **lognormal distribution** is a distribution on $\left(0, \infty\right)$ with density
Expand Down Expand Up @@ -646,7 +636,6 @@ plt.legend()
plt.show()
```


#### Gamma distribution

The **gamma distribution** is a distribution on $\left(0, \infty\right)$ with density
Expand Down Expand Up @@ -730,7 +719,6 @@ df = pd.DataFrame(data, columns=['name', 'income'])
df
```


In this situation, we might refer to the set of their incomes as the "income distribution."

The terminology is confusing because this set is not a probability distribution
Expand Down Expand Up @@ -761,14 +749,10 @@ $$
For the income distribution given above, we can calculate these numbers via

```{code-cell} ipython3
x = np.asarray(df['income']) # Pull out income as a NumPy array
```

```{code-cell} ipython3
x = df['income']
x.mean(), x.var()
```


```{exercise}
:label: prob_ex4
Expand All @@ -792,15 +776,13 @@ We will cover
We can histogram the income distribution we just constructed as follows

```{code-cell} ipython3
x = df['income']
fig, ax = plt.subplots()
ax.hist(x, bins=5, density=True, histtype='bar')
ax.set_xlabel('income')
ax.set_ylabel('density')
plt.show()
```


Let's look at a distribution from real data.

In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2024/1/1.
Expand All @@ -811,25 +793,21 @@ So we will have one observation for each month.

```{code-cell} ipython3
:tags: [hide-output]
df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo' )
df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo')
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
data.head()
x_amazon = prices.pct_change()[1:] * 100
x_amazon.head()
```


The first observation is the monthly return (percent change) over January 2000, which was

```{code-cell} ipython3
data[0]
x_amazon.iloc[0]
```

Let's turn the return observations into an array and histogram it.

```{code-cell} ipython3
x_amazon = np.asarray(data)
```

```{code-cell} ipython3
fig, ax = plt.subplots()
ax.hist(x_amazon, bins=20)
Expand All @@ -838,7 +816,6 @@ ax.set_ylabel('density')
plt.show()
```


#### Kernel density estimates

Kernel density estimates (KDE) provide a simple way to estimate and visualize the density of a distribution.
Expand Down Expand Up @@ -893,10 +870,10 @@ For example, let's compare the monthly returns on Amazon shares with the monthly

```{code-cell} ipython3
:tags: [hide-output]
df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo' )
df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo')
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
x_costco = np.asarray(data)
x_costco = prices.pct_change()[1:] * 100
```

```{code-cell} ipython3
Expand All @@ -907,7 +884,6 @@ ax.set_xlabel('KDE')
plt.show()
```


### Connection to probability distributions

Let's discuss the connection between observed distributions and probability distributions.
Expand Down Expand Up @@ -941,7 +917,6 @@ ax.set_ylabel('density')
plt.show()
```


The match between the histogram and the density is not bad but also not very good.

One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about {ref}`heavy tailed distributions<heavy_tail>`.
Expand All @@ -967,8 +942,6 @@ ax.set_ylabel('density')
plt.show()
```


Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better.

This convergence is a version of the "law of large numbers", which we will discuss {ref}`later<lln_mr>`.

1 comment on commit a58c4af

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.