Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inequality] Update exercise 3 #498

Merged
merged 13 commits into from
Jul 5, 2024
Merged

[inequality] Update exercise 3 #498

merged 13 commits into from
Jul 5, 2024

Conversation

longye-tian
Copy link
Collaborator

Hi Matt @mmcky ,

I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.

What do you think about this version of the solution?

Best,
Longye

Hi Matt @mmcky ,

I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.

What do you think about this version of the solution?

Best,
Longye
@longye-tian longye-tian requested a review from mmcky July 2, 2024 00:18
Copy link

netlify bot commented Jul 2, 2024

Deploy Preview for taupe-gaufre-c4e660 ready!

Name Link
🔨 Latest commit 6e2d53e
🔍 Latest deploy log https://app.netlify.com/sites/taupe-gaufre-c4e660/deploys/668774623d4e3f00080ac468
😎 Deploy Preview https://deploy-preview-498--taupe-gaufre-c4e660.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mmcky
Copy link
Contributor

mmcky commented Jul 2, 2024

thanks @longye-tian this is looking really good.

I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.

Hi Matt,
I have updated the solution and in the main text by adding ` %%time`.

What do you think about this comparison?
@longye-tian
Copy link
Collaborator Author

thanks @longye-tian this is looking really good.

I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works.

Hi Matt @mmcky ,

I just updated the solution and main text by adding the %%time to show the computation time.

What do you think about this comparison?

Best,
Longye

@mmcky
Copy link
Contributor

mmcky commented Jul 2, 2024

@longye-tian just have an undefined label

  • /home/runner/work/lecture-python-intro/lecture-python-intro/lectures/inequality.md:1101: WARNING: undefined label: code:gini-coefficient

add labels to the main text gini coefficient code.
@longye-tian
Copy link
Collaborator Author

longye-tian commented Jul 2, 2024

@longye-tian just have an undefined label

  • /home/runner/work/lecture-python-intro/lecture-python-intro/lectures/inequality.md:1101: WARNING: undefined label: code:gini-coefficient

Thank you very much Matt!

I just added the label:

(code:gini-coefficient) = 

I hope this will pass the checks 👍

Best,
Longye

Copy link

github-actions bot commented Jul 2, 2024

@github-actions github-actions bot temporarily deployed to pull request July 2, 2024 02:21 Inactive
@mmcky
Copy link
Contributor

mmcky commented Jul 3, 2024

@longye-tian the kernel is dying when testing agains the google collab environment.

Would you mind running the notebook version of this PR on Google Collab to see if you can replicate this issue?

You can use jupytext to convert between the two formats.

https://manual.quantecon.org/writing/converting.html#myst-markdown-md-to-jupyter-notebook-ipynb

@longye-tian
Copy link
Collaborator Author

Hi Matt @mmcky ,

Thank you for this information. Google colab returns the following error:

For the code:

!pip install quantecon
import quantecon as qe

varlist = ['n_wealth',   # net wealth 
           't_income',   # total income
           'l_income']   # labor income

df = df_income_wealth

# create lists to store Gini for each inequality measure
results = {}

for var in varlist:
    # create lists to store Gini
    gini_yr = []
    for year in years:
        # repeat the observations according to their weights
        counts = list(round(df[df['year'] == year]['weights'] ))
        y = df[df['year'] == year][var].repeat(counts)
        y = np.asarray(y)
        
        rd.shuffle(y)    # shuffle the sequence
      
        # calculate and store Gini
        gini = qe.gini_coefficient(y)
        gini_yr.append(gini)
        
    results[var] = gini_yr

# Convert to DataFrame
results = pd.DataFrame(results, index=years)
results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')

It returns

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-20-a7cb0b0c20b0>](https://localhost:8080/#) in <cell line: 32>()
     30 # Convert to DataFrame
     31 results = pd.DataFrame(results, index=years)
---> 32 results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year')

4 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/common.py](https://localhost:8080/#) in check_parent_directory(path)
    598     parent = Path(path).parent
    599     if not parent.is_dir():
--> 600         raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")
    601 
    602 

OSError: Cannot save file into a non-existent directory: '_static/lecture_specific/inequality'

What do you think about this issue?

Best,
Longye

@mmcky
Copy link
Contributor

mmcky commented Jul 3, 2024

thanks @longye-tian. That makes sense.

Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv method otherwise

My thought is:

  1. we should have the code that generates that static data as a notebook in _static/lecture_specific/inequality/data.ipynb
  2. remove the to_csv method in the lecture code.

@longye-tian
Copy link
Collaborator Author

thanks @longye-tian. That makes sense.

Do we import that data later in the lecture, or in another lecture? If not, we could remove the to_csv method otherwise

My thought is:

  1. we should have the code that generates that static data as a notebook in _static/lecture_specific/inequality/data.ipynb
  2. remove the to_csv method in the lecture code.

Hi Matt @mmcky ,

I will look into it ! Thank you for the idea.

Best,
Longye

Hi Matt,

I have added the data.ipynb to the folder and I think it contains sufficient code to save the data.

I have also modified the contain to deal with the saving and call issues related to the csv.

What do you think about these changes?

Best,
Longye
@longye-tian
Copy link
Collaborator Author

Hi Matt @mmcky ,

Just updated the data.ipynb and the code.

What do you think?

Best,
Longye

@mmcky
Copy link
Contributor

mmcky commented Jul 4, 2024

thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?

@longye-tian
Copy link
Collaborator Author

thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture?

Hi Matt @mmcky ,

In the same lecture, just after the save csv code, we have,

ginis = pd.read_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_col='year')

which also reports an error in the google colab. So I changed that code too.

Another issue I find in this lecture is when I use Google colab to run the updated version, some code such as

gini_coefficient(data.n_wealth.values)

in the exercise section, takes 10 minutes to run and later lead my colab page to crash after using all available RAM.

Best,
Longye

@mmcky
Copy link
Contributor

mmcky commented Jul 4, 2024

thanks @longye-tian let me dive into this PR after lunch. I suspect the initial thought was to fetch the raw data from github directly (and there is probabably a no-execute tag on the data generating cell. I'll review this.

@mmcky
Copy link
Contributor

mmcky commented Jul 4, 2024

@longye-tian I have made some adjustments and simplifications now that the data.ipynb has been committed. The code block that contained the data generating code had skip-execution as a tag but google collab doesn't understand that context. So I have switched to a download role.

This PR won't execute as the data is not yet available on github. But if you are happy with everything else I can merge and test.

@github-actions github-actions bot temporarily deployed to pull request July 4, 2024 04:32 Inactive
@mmcky
Copy link
Contributor

mmcky commented Jul 4, 2024

@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian

@longye-tian
Copy link
Collaborator Author

longye-tian commented Jul 4, 2024

@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian

Hi Matt @mmcky ,

I just tested in Google Colab; I think the failure to build with Google colab is because the dataset used in the exercise is to large.

data.n_wealth.values

When running the code in Google Colab, this leads the RAM to exhaust and leads to a crash.

I currently changed the code by not computing the whole dataset (30000+ obs) but only with 3000 observations by the following code:

gini_coefficient(data.n_wealth.values[1:3000])

gini(data.n_wealth.values[1:3000])

This resolves the current building failure problem. What do you think about this change?

Best,
Longye.

This commit is to test whether the problem is due to this code.
@github-actions github-actions bot temporarily deployed to pull request July 4, 2024 11:45 Inactive
This reverts commit 395657e.
this commit is to test whether the crash is led by the
@github-actions github-actions bot temporarily deployed to pull request July 4, 2024 12:21 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 4, 2024 12:23 Inactive
@@ -616,51 +619,11 @@ We will use US data from the {ref}`Survey of Consumer Finances<data:survey-consu
df_income_wealth.year.describe()
```

This code can be used to compute this information over the full dataset.
{download}`This notebook <_static/lecture_specific/inequality/data.ipynb>` can be used to compute this information over the full dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this look in the printed version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jstac this used to render in the pdf as a link but it looks like that has changed in sphinx. I'll open a meta issue as we use these in a few cases.

I have changed this to a link to github which you can download the notebook (or just view it).

@jstac
Copy link
Contributor

jstac commented Jul 4, 2024

Many thanks @longye-tian for your hard work on this. @mmcky , I'll leave this one for you to merge when ready. There's a small comment above.

I recommend that we go for the simplest options at each stage, focusing on ease of maintenance, not the most ambitious.

@github-actions github-actions bot temporarily deployed to pull request July 5, 2024 00:11 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 5, 2024 00:11 Inactive
@mmcky
Copy link
Contributor

mmcky commented Jul 5, 2024

@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.

A general rule of thumb for pandas is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.

@mmcky
Copy link
Contributor

mmcky commented Jul 5, 2024

@longye-tian I am currently running a

%prun gini_coefficient(data.n_wealth.values)

and will post the results here when the come in.

@longye-tian
Copy link
Collaborator Author

@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it.

A general rule of thumb for pandas is you need 3 x RAM for the amount of data you a processing. I think we should take a look at the code to make sure we aren't creating lots of copies somewhere. It doesn't make sense to me that we would run out of RAM.

Hi Matt @mmcky ,

The data length for data.n_wealth.values is 31240.

And for the gini function, we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64, it requires around 8GB memory.

I think that could be a potential reason.

Best,
Longye

@mmcky
Copy link
Contributor

mmcky commented Jul 5, 2024

we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64

Nice investigative work @longye-tian. Spot on -- that'a a big matrix!

Screenshot 2024-07-05 at 10 53 52 AM

I had just profiled the non-vectorized code which looked fine from memory perspective.
Screenshot 2024-07-05 at 10 52 19 AM

and memory is

peak memory: 266.78 MiB, increment: 1.45 MiB

this is a really nice example of the tradeoffs between compute and memory :-)

@mmcky
Copy link
Contributor

mmcky commented Jul 5, 2024

@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?

Thanks for digging into this with me. It's great we understand the problem now.

@longye-tian
Copy link
Collaborator Author

@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs?

Thanks for digging into this with me. It's great we understand the problem now.

Hi Matt @mmcky ,

Thank you for running the test. I think the dataset we got is not ordered. Here is a screenshot of the first 30ish observations:

Screenshot 2024-07-05 at 11 47 26 AM

What do you think about this?

Best,
Longye

@mmcky
Copy link
Contributor

mmcky commented Jul 5, 2024

thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.

Hi Matt,

This commit select 3000 random sample from the original dataset.

Best,
Longye
@github-actions github-actions bot temporarily deployed to pull request July 5, 2024 03:57 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 5, 2024 03:58 Inactive
@longye-tian
Copy link
Collaborator Author

thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency.

Hi Matt @mmcky ,

Thank you for this suggestion! Just commit to incorporate this.

Best,
Longye

@@ -1084,7 +1084,7 @@ df_income_wealth.head(n=4)
We will focus on wealth variable `n_wealth` to compute a Gini coefficient for the year 1990.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longye-tian can we either change 1990 in the text to 2016 or the otherway around based on the correct context?

@mmcky
Copy link
Contributor

mmcky commented Jul 5, 2024

thanks @longye-tian -- love your work. I think the sample approach is a good approach and tidy.

update year in the text
@mmcky mmcky self-requested a review July 5, 2024 04:20
Copy link
Contributor

@mmcky mmcky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @longye-tian -- once CI is green I will merge this.

Appreciate your work on this.

@github-actions github-actions bot temporarily deployed to pull request July 5, 2024 04:26 Inactive
@github-actions github-actions bot temporarily deployed to pull request July 5, 2024 04:28 Inactive
@mmcky mmcky merged commit 2b7dd96 into main Jul 5, 2024
7 checks passed
@mmcky mmcky deleted the inequality_exercise branch July 5, 2024 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants