-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inequality] Update exercise 3 #498
Conversation
✅ Deploy Preview for taupe-gaufre-c4e660 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
thanks @longye-tian this is looking really good. I had a Quick Look and I think it would be great to add timing comparisons between the two functions to demonstrate the how much quicker the vectorized code works. |
Hi Matt, I have updated the solution and in the main text by adding ` %%time`. What do you think about this comparison?
Hi Matt @mmcky , I just updated the solution and main text by adding the What do you think about this comparison? Best, |
@longye-tian just have an undefined label
|
add labels to the main text gini coefficient code.
Thank you very much Matt! I just added the label:
I hope this will pass the checks 👍 Best, |
@longye-tian the kernel is dying when testing agains the Would you mind running the notebook version of this PR on Google Collab to see if you can replicate this issue? You can use https://manual.quantecon.org/writing/converting.html#myst-markdown-md-to-jupyter-notebook-ipynb |
Hi Matt @mmcky , Thank you for this information. Google colab returns the following error: For the code: !pip install quantecon
import quantecon as qe
varlist = ['n_wealth', # net wealth
't_income', # total income
'l_income'] # labor income
df = df_income_wealth
# create lists to store Gini for each inequality measure
results = {}
for var in varlist:
# create lists to store Gini
gini_yr = []
for year in years:
# repeat the observations according to their weights
counts = list(round(df[df['year'] == year]['weights'] ))
y = df[df['year'] == year][var].repeat(counts)
y = np.asarray(y)
rd.shuffle(y) # shuffle the sequence
# calculate and store Gini
gini = qe.gini_coefficient(y)
gini_yr.append(gini)
results[var] = gini_yr
# Convert to DataFrame
results = pd.DataFrame(results, index=years)
results.to_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_label='year') It returns
What do you think about this issue? Best, |
thanks @longye-tian. That makes sense. Do we import that data later in the lecture, or in another lecture? If not, we could remove the My thought is:
|
Hi Matt @mmcky , I will look into it ! Thank you for the idea. Best, |
Hi Matt, I have added the data.ipynb to the folder and I think it contains sufficient code to save the data. I have also modified the contain to deal with the saving and call issues related to the csv. What do you think about these changes? Best, Longye
Hi Matt @mmcky , Just updated the data.ipynb and the code. What do you think? Best, |
thanks @longye-tian can you let me know where the data is imported? Is it later in the same lecture or in another lecture? |
Hi Matt @mmcky , In the same lecture, just after the save csv code, we have, ginis = pd.read_csv("_static/lecture_specific/inequality/usa-gini-nwealth-tincome-lincome.csv", index_col='year') which also reports an error in the google colab. So I changed that code too. Another issue I find in this lecture is when I use Google colab to run the updated version, some code such as gini_coefficient(data.n_wealth.values) in the exercise section, takes 10 minutes to run and later lead my colab page to crash after using all available RAM. Best, |
thanks @longye-tian let me dive into this PR after lunch. I suspect the initial thought was to fetch the raw data from github directly (and there is probabably a |
@longye-tian I have made some adjustments and simplifications now that the This PR won't execute as the data is not yet available on github. But if you are happy with everything else I can merge and test. |
@longye-tian can you test out this latest version of of the lecture on google collab. The jupyter kernel is dying on our test but not sure what would be driving that now. Thanks @longye-tian |
Hi Matt @mmcky , I just tested in Google Colab; I think the failure to build with Google colab is because the dataset used in the exercise is to large. data.n_wealth.values When running the code in Google Colab, this leads the RAM to exhaust and leads to a crash. I currently changed the code by not computing the whole dataset (30000+ obs) but only with 3000 observations by the following code: gini_coefficient(data.n_wealth.values[1:3000])
gini(data.n_wealth.values[1:3000]) This resolves the current building failure problem. What do you think about this change? Best, |
This commit is to test whether the problem is due to this code.
This reverts commit 395657e.
this commit is to test whether the crash is led by the
lectures/inequality.md
Outdated
@@ -616,51 +619,11 @@ We will use US data from the {ref}`Survey of Consumer Finances<data:survey-consu | |||
df_income_wealth.year.describe() | |||
``` | |||
|
|||
This code can be used to compute this information over the full dataset. | |||
{download}`This notebook <_static/lecture_specific/inequality/data.ipynb>` can be used to compute this information over the full dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this look in the printed version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jstac this used to render in the pdf
as a link but it looks like that has changed in sphinx. I'll open a meta
issue as we use these in a few cases.
I have changed this to a link to github which you can download the notebook (or just view it).
Many thanks @longye-tian for your hard work on this. @mmcky , I'll leave this one for you to merge when ready. There's a small comment above. I recommend that we go for the simplest options at each stage, focusing on ease of maintenance, not the most ambitious. |
@longye-tian the source data file is only 31mb so I don't understand why google colab would not be able to process it. A general rule of thumb for |
@longye-tian I am currently running a
and will post the results here when the come in. |
Hi Matt @mmcky , The data length for And for the gini function, we use vectorization which create a matrix of size (31240*31240) with 8 bytes per float64, it requires around 8GB memory. I think that could be a potential reason. Best, |
Nice investigative work @longye-tian. Spot on -- that'a a big matrix! I had just profiled the non-vectorized code which looked fine from memory perspective. and memory is
this is a really nice example of the tradeoffs between compute and memory :-) |
@longye-tian I think your approach of taking a sample from the full data is a good idea for the exercise. My only question is if the data is ordered or not -- should we take a random sample of 3000 or the first 3000 obs? Thanks for digging into this with me. It's great we understand the problem now. |
Hi Matt @mmcky , Thank you for running the test. I think the dataset we got is not ordered. Here is a screenshot of the first 30ish observations: What do you think about this? Best, |
thanks @longye-tian I think we should do a random sample of 3000 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) with a seed for consistency. |
Hi Matt, This commit select 3000 random sample from the original dataset. Best, Longye
Hi Matt @mmcky , Thank you for this suggestion! Just commit to incorporate this. Best, |
lectures/inequality.md
Outdated
@@ -1084,7 +1084,7 @@ df_income_wealth.head(n=4) | |||
We will focus on wealth variable `n_wealth` to compute a Gini coefficient for the year 1990. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@longye-tian can we either change 1990
in the text to 2016
or the otherway around based on the correct context?
thanks @longye-tian -- love your work. I think the sample approach is a good approach and tidy. |
update year in the text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @longye-tian -- once CI is green I will merge this.
Appreciate your work on this.
Hi Matt @mmcky ,
I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.
What do you think about this version of the solution?
Best,
Longye