[inequality] Update exercise 3 #498

longye-tian · 2024-07-02T00:18:09Z

Hi Matt @mmcky ,

I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.

What do you think about this version of the solution?

Best,
Longye

@mmcky

Hi Matt @mmcky , I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution. What do you think about this version of the solution? Best, Longye

netlify · 2024-07-02T00:18:23Z

✅ Deploy Preview for taupe-gaufre-c4e660 ready!

Name	Link
🔨 Latest commit	`6e2d53e`
🔍 Latest deploy log	https://app.netlify.com/sites/taupe-gaufre-c4e660/deploys/668774623d4e3f00080ac468
😎 Deploy Preview	https://deploy-preview-498--taupe-gaufre-c4e660.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mmcky · 2024-07-02T00:23:38Z

thanks @longye-tian this is looking really good.

I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.

Hi Matt, I have updated the solution and in the main text by adding ` %%time`. What do you think about this comparison?

longye-tian · 2024-07-02T00:41:02Z

thanks @longye-tian this is looking really good.

I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.

Hi Matt @mmcky ,

I just updated the solution and main text by adding the %%time to show the computation time.

What do you think about this comparison?

Best,
Longye

mmcky · 2024-07-02T01:51:51Z

@longye-tian just have an undefined label

/home/runner/work/lecture-python-intro/lecture-python-intro/lectures/inequality.md:1101: WARNING: undefined label: code:gini-coefficient

add labels to the main text gini coefficient code.

longye-tian · 2024-07-02T01:58:02Z

@longye-tian just have an undefined label

/home/runner/work/lecture-python-intro/lecture-python-intro/lectures/inequality.md:1101: WARNING: undefined label: code:gini-coefficient

Thank you very much Matt!

I just added the label:

(code:gini-coefficient) =

I hope this will pass the checks 👍

Best,
Longye

github-actions · 2024-07-02T02:21:11Z

🚀 Deployed on https://668776713756a8db55160eea--taupe-gaufre-c4e660.netlify.app

mmcky · 2024-07-03T00:03:53Z

@longye-tian the kernel is dying when testing agains the google collab environment.

Would you mind running the notebook version of this PR on Google Collab to see if you can replicate this issue?

You can use jupytext to convert between the two formats.

https://manual.quantecon.org/writing/converting.html#myst-markdown-md-to-jupyter-notebook-ipynb

longye-tian · 2024-07-03T00:55:28Z

Hi Matt @mmcky ,

Thank you for this information. Google colab returns the following error:

For the code:

!pip install quantecon
import quantecon as qe

varlist = ['n_wealth',   # net wealth 
           't_income',   # total income
           'l_income']   # labor income

df = df_income_wealth

# create lists to store Gini for each inequality measure
results = {}

for var in varlist:
    # create lists to store Gini
    gini_yr = []
    for year in years:
        # repeat the observations according to their weights
        counts = list(round(df[df['year'] == year]['weights'] ))
        y = df[df['year'] == year][var].repeat(counts)
        y = np.asarray(y)
        
        rd.shuffle(y)    # shuffle the sequence
      
        # calculate and store Gini
        gini = qe.gini_coefficient(y)
        gini_yr.append(gini)
        
    results[var] = gini_yr

# Convert to DataFrame
results = pd.DataFrame(results, index=years)
results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')

It returns

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-20-a7cb0b0c20b0>](https://localhost:8080/#) in <cell line: 32>()
     30 # Convert to DataFrame
     31 results = pd.DataFrame(results, index=years)
---> 32 results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')

4 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/common.py](https://localhost:8080/#) in check_parent_directory(path)
    598     parent = Path(path).parent
    599     if not parent.is_dir():
--> 600         raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
    601 
    602 

OSError: Cannot save file into a non-existent directory: '_static/lecture_specific/inequality'

What do you think about this issue?

Best,
Longye

mmcky · 2024-07-03T04:04:52Z

thanks @longye-tian. That makes sense.

Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv method otherwise

My thought is:

we should have the code that generates that static data as a notebook in _static/lecture_specific/inequality/data.ipynb
remove the to_csv method in the lecture code.

longye-tian · 2024-07-03T06:25:36Z

thanks @longye-tian. That makes sense.

Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv method otherwise

My thought is:

we should have the code that generates that static data as a notebook in _static/lecture_specific/inequality/data.ipynb

remove the to_csv method in the lecture code.

Hi Matt @mmcky ,

I will look into it ! Thank you for the idea.

Best,
Longye

Hi Matt, I have added the data.ipynb to the folder and I think it contains sufficient code to save the data. I have also modified the contain to deal with the saving and call issues related to the csv. What do you think about these changes? Best, Longye

longye-tian · 2024-07-04T02:00:20Z

Hi Matt @mmcky ,

Just updated the data.ipynb and the code.

What do you think?

Best,
Longye

mmcky · 2024-07-04T02:29:56Z

thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?

longye-tian · 2024-07-04T02:48:02Z

thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?

Hi Matt @mmcky ,

In the same lecture, just after the save csv code, we have,

ginis = pd.read_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_col='year')

which also reports an error in the google colab. So I changed that code too.

Another issue I find in this lecture is when I use Google colab to run the updated version, some code such as

gini_coefficient(data.n_wealth.values)

in the exercise section, takes 10 minutes to run and later lead my colab page to crash after using all available RAM.

Best,
Longye

mmcky · 2024-07-04T02:52:09Z

thanks @longye-tian let me dive into this PR after lunch. I suspect the initial thought was to fetch the raw data from github directly (and there is probabably a no-execute tag on the data generating cell. I'll review this.

mmcky · 2024-07-04T04:21:06Z

@longye-tian I have made some adjustments and simplifications now that the data.ipynb has been committed. The code block that contained the data generating code had skip-execution as a tag but google collab doesn't understand that context. So I have switched to a download role.

This PR won't execute as the data is not yet available on github. But if you are happy with everything else I can merge and test.

mmcky · 2024-07-04T04:45:25Z

@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian

longye-tian · 2024-07-04T11:35:22Z

@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian

Hi Matt @mmcky ,

I just tested in Google Colab; I think the failure to build with Google colab is because the dataset used in the exercise is to large.

data.n_wealth.values

When running the code in Google Colab, this leads the RAM to exhaust and leads to a crash.

I currently changed the code by not computing the whole dataset (30000+ obs) but only with 3000 observations by the following code:

gini_coefficient(data.n_wealth.values[1:3000])

gini(data.n_wealth.values[1:3000])

This resolves the current building failure problem. What do you think about this change?

Best,
Longye.

This commit is to test whether the problem is due to this code.

This reverts commit 395657e.

this commit is to test whether the crash is led by the

jstac · 2024-07-04T21:39:32Z

lectures/inequality.md

@@ -616,51 +619,11 @@ We will use US data from the {ref}`Survey of Consumer Finances<data:survey-consu
 df_income_wealth.year.describe()
 ```

-This code can be used to compute this information over the full dataset.
+{download}`This notebook <_static/lecture_specific/inequality/data.ipynb>` can be used to compute this information over the full dataset.


How will this look in the printed version?

@jstac this used to render in the pdf as a link but it looks like that has changed in sphinx. I'll open a meta issue as we use these in a few cases.

I have changed this to a link to github which you can download the notebook (or just view it).

jstac · 2024-07-04T21:42:28Z

Many thanks @longye-tian for your hard work on this. @mmcky , I'll leave this one for you to merge when ready. There's a small comment above.

I recommend that we go for the simplest options at each stage, focusing on ease of maintenance, not the most ambitious.

…cture-python-intro into inequality_exercise

mmcky · 2024-07-05T00:12:43Z

@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.

A general rule of thumb for pandas is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.

mmcky · 2024-07-05T00:26:33Z

@longye-tian I am currently running a

%prun gini_coefficient(data.n_wealth.values)

and will post the results here when the come in.

longye-tian · 2024-07-05T00:36:24Z

@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.

A general rule of thumb for pandas is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.

Hi Matt @mmcky ,

The data length for data.n_wealth.values is 31240.

And for the gini function, we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64, it requires around 8GB memory.

I think that could be a potential reason.

Best,
Longye

mmcky · 2024-07-05T00:54:17Z

we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64

Nice investigative work @longye-tian. Spot on -- that'a a big matrix!

I had just profiled the non-vectorized code which looked fine from memory perspective.

and memory is

peak memory: 266.78 MiB, increment: 1.45 MiB

this is a really nice example of the tradeoffs between compute and memory :-)

mmcky · 2024-07-05T00:58:14Z

@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?

Thanks for digging into this with me. It's great we understand the problem now.

longye-tian · 2024-07-05T01:49:38Z

@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?

Thanks for digging into this with me. It's great we understand the problem now.

Hi Matt @mmcky ,

Thank you for running the test. I think the dataset we got is not ordered. Here is a screenshot of the first 30ish observations:

What do you think about this?

Best,
Longye

mmcky · 2024-07-05T02:04:31Z

thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.

Hi Matt, This commit select 3000 random sample from the original dataset. Best, Longye

longye-tian · 2024-07-05T04:14:58Z

thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.

Hi Matt @mmcky ,

Thank you for this suggestion! Just commit to incorporate this.

Best,
Longye

mmcky · 2024-07-05T04:16:53Z

lectures/inequality.md

@@ -1084,7 +1084,7 @@ df_income_wealth.head(n=4)
 We will focus on wealth variable `n_wealth` to compute a Gini coefficient for the year 1990.


@longye-tian can we either change 1990 in the text to 2016 or the otherway around based on the correct context?

mmcky · 2024-07-05T04:17:23Z

thanks @longye-tian -- love your work. I think the sample approach is a good approach and tidy.

update year in the text

mmcky

thanks @longye-tian -- once CI is green I will merge this.

Appreciate your work on this.

[inequality] Update exercise 3

6daf793

Hi Matt @mmcky , I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution. What do you think about this version of the solution? Best, Longye

longye-tian requested a review from mmcky July 2, 2024 00:18

Update inequality.md

73aab7b

Hi Matt, I have updated the solution and in the main text by adding ` %%time`. What do you think about this comparison?

Update inequality.md

f3c282b

add labels to the main text gini coefficient code.

Update inequality.md

384a3d9

github-actions bot temporarily deployed to pull request July 2, 2024 02:21 Inactive

add data.ipynb and delete to csv

f22b53d

Hi Matt, I have added the data.ipynb to the folder and I think it contains sufficient code to save the data. I have also modified the contain to deal with the saving and call issues related to the csv. What do you think about these changes? Best, Longye

remove skip-execution code as it is not compatible with google collab

3653c62

github-actions bot temporarily deployed to pull request July 4, 2024 04:32 Inactive

test the problem

395657e

This commit is to test whether the problem is due to this code.

github-actions bot temporarily deployed to pull request July 4, 2024 11:45 Inactive

longye-tian added 2 commits July 4, 2024 22:07

Revert "test the problem"

15d1ae6

This reverts commit 395657e.

test google colab RAM

a28f57c

this commit is to test whether the crash is led by the

github-actions bot temporarily deployed to pull request July 4, 2024 12:21 Inactive

github-actions bot temporarily deployed to pull request July 4, 2024 12:23 Inactive

jstac reviewed Jul 4, 2024

View reviewed changes

jstac approved these changes Jul 4, 2024

View reviewed changes

mmcky added 2 commits July 5, 2024 10:02

change link to notebook on github

bbab6ca

Merge branch 'inequality_exercise' of https://github.com/QuantEcon/le…

0aec0eb

…cture-python-intro into inequality_exercise

github-actions bot temporarily deployed to pull request July 5, 2024 00:11 Inactive

update_inequality_exercise

971b327

Hi Matt, This commit select 3000 random sample from the original dataset. Best, Longye

github-actions bot temporarily deployed to pull request July 5, 2024 03:57 Inactive

github-actions bot temporarily deployed to pull request July 5, 2024 03:58 Inactive

mmcky reviewed Jul 5, 2024

View reviewed changes

update year in the text

6e2d53e

update year in the text

mmcky self-requested a review July 5, 2024 04:20

mmcky approved these changes Jul 5, 2024

View reviewed changes

github-actions bot temporarily deployed to pull request July 5, 2024 04:26 Inactive

github-actions bot temporarily deployed to pull request July 5, 2024 04:28 Inactive

mmcky merged commit 2b7dd96 into main Jul 5, 2024
7 checks passed

mmcky deleted the inequality_exercise branch July 5, 2024 04:41

longye-tian mentioned this pull request Jul 19, 2024

[inequality] Incorporate a new exercise on vectorizing the gini_coefficient function #410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inequality] Update exercise 3 #498

[inequality] Update exercise 3 #498

longye-tian commented Jul 2, 2024

netlify bot commented Jul 2, 2024 •

edited

Loading

mmcky commented Jul 2, 2024

longye-tian commented Jul 2, 2024

mmcky commented Jul 2, 2024

longye-tian commented Jul 2, 2024 •

edited

Loading

github-actions bot commented Jul 2, 2024 •

edited

Loading

mmcky commented Jul 3, 2024

longye-tian commented Jul 3, 2024

mmcky commented Jul 3, 2024

longye-tian commented Jul 3, 2024

longye-tian commented Jul 4, 2024

mmcky commented Jul 4, 2024

longye-tian commented Jul 4, 2024

mmcky commented Jul 4, 2024

mmcky commented Jul 4, 2024

mmcky commented Jul 4, 2024

longye-tian commented Jul 4, 2024 •

edited

Loading

jstac Jul 4, 2024

mmcky Jul 5, 2024

jstac commented Jul 4, 2024

mmcky commented Jul 5, 2024

mmcky commented Jul 5, 2024

longye-tian commented Jul 5, 2024

mmcky commented Jul 5, 2024 •

edited

Loading

mmcky commented Jul 5, 2024 •

edited

Loading

longye-tian commented Jul 5, 2024

mmcky commented Jul 5, 2024 •

edited

Loading

longye-tian commented Jul 5, 2024

mmcky Jul 5, 2024

mmcky commented Jul 5, 2024 •

edited

Loading

mmcky left a comment

		@@ -1084,7 +1084,7 @@ df_income_wealth.head(n=4)
		We will focus on wealth variable `n_wealth` to compute a Gini coefficient for the year 1990.

[inequality] Update exercise 3 #498

[inequality] Update exercise 3 #498

Conversation

longye-tian commented Jul 2, 2024

netlify bot commented Jul 2, 2024 • edited Loading

✅ Deploy Preview for taupe-gaufre-c4e660 ready!

mmcky commented Jul 2, 2024

longye-tian commented Jul 2, 2024

mmcky commented Jul 2, 2024

longye-tian commented Jul 2, 2024 • edited Loading

github-actions bot commented Jul 2, 2024 • edited Loading

mmcky commented Jul 3, 2024

longye-tian commented Jul 3, 2024

mmcky commented Jul 3, 2024

longye-tian commented Jul 3, 2024

longye-tian commented Jul 4, 2024

mmcky commented Jul 4, 2024

longye-tian commented Jul 4, 2024

mmcky commented Jul 4, 2024

mmcky commented Jul 4, 2024

mmcky commented Jul 4, 2024

longye-tian commented Jul 4, 2024 • edited Loading

jstac Jul 4, 2024

Choose a reason for hiding this comment

mmcky Jul 5, 2024

Choose a reason for hiding this comment

jstac commented Jul 4, 2024

mmcky commented Jul 5, 2024

mmcky commented Jul 5, 2024

longye-tian commented Jul 5, 2024

mmcky commented Jul 5, 2024 • edited Loading

mmcky commented Jul 5, 2024 • edited Loading

longye-tian commented Jul 5, 2024

mmcky commented Jul 5, 2024 • edited Loading

longye-tian commented Jul 5, 2024

mmcky Jul 5, 2024

Choose a reason for hiding this comment

mmcky commented Jul 5, 2024 • edited Loading

mmcky left a comment

Choose a reason for hiding this comment

netlify bot commented Jul 2, 2024 •

edited

Loading

longye-tian commented Jul 2, 2024 •

edited

Loading

github-actions bot commented Jul 2, 2024 •

edited

Loading

longye-tian commented Jul 4, 2024 •

edited

Loading

mmcky commented Jul 5, 2024 •

edited

Loading

mmcky commented Jul 5, 2024 •

edited

Loading

mmcky commented Jul 5, 2024 •

edited

Loading

mmcky commented Jul 5, 2024 •

edited

Loading