Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incidence_table with nan values #185

Open
LotteNotelaers opened this issue Sep 23, 2024 · 5 comments
Open

incidence_table with nan values #185

LotteNotelaers opened this issue Sep 23, 2024 · 5 comments

Comments

@LotteNotelaers
Copy link

Dear,

I get an error running the setup_data_structures.py step. Specifically, when running the def build_incidence_table line 87.

This is the incidence dataframe (result line 85)
image

If the next line is run (line 87): incidence_table[control_row.target] = incidence
the result is:
image
So it looks like the numbers are not transferred to the incidence_table dataframe.

Can you help me with this issue?

Kind regards,
Lotte

@bettinardi
Copy link
Collaborator

I'm guessing there's either a configuration file inconsistency or a crosswalk file inconsistency. Would you like to share your configuration yaml and/or your geography crosswalk file

@LotteNotelaers
Copy link
Author

Hi bettinardi,

Thanks for your help. Here are the files:
geo_cross_walk.csv
settings.zip
I needed to zip the settings.yaml because this file type is not supported by Github.

Kind regards,
Lotte

@LotteNotelaers
Copy link
Author

Hi,

I think it has to do with the household_df indices being a string and the hh_id column in the person_df being of mixed type.

In the input seed data, the SERIALNO (=hh_id) column is of mixed type, both int and str are in that column. I found that you can specify the dtypes in the settings.yaml. This makes sure they are consistently recognized as strings when reading the csv files. This resolved the problem.

image

Thanks for the help!

@LotteNotelaers
Copy link
Author

Hi,

In line 242 in setup_data_structures.py I get an error now because it tries to set the type of the hh_id to int.

household_groups[household_id_col] = household_groups.index.astype(int)

{OverflowError}Python int too large to convert to C long

This doesn`t work because the hh_id contains numbers but also sometimes letters.

What would be the best way to resolve this?

  • change the code to household_groups[household_id_col] = household_groups.index.astype(str)
  • or add new household IDs to the seed data and make sure the new IDs only contain numbers?

Kind regards,
Lotte

@bettinardi
Copy link
Collaborator

add new household IDs to the seed data and make sure the new IDs only contain numbers

HH_ID is different than a PUMS serial number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants