Skip to content

Commit

Permalink
modified script in preprocess.py which downloads data and process
Browse files Browse the repository at this point in the history
  • Loading branch information
Harsha-chandaluri committed Nov 1, 2024
1 parent d8e35d9 commit 3ccd9bb
Show file tree
Hide file tree
Showing 29 changed files with 5,102 additions and 2,921 deletions.
12 changes: 5 additions & 7 deletions scripts/us_census/pep/population_estimate_by_race/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# US Census PEP: Population Estimate by Race

## About the Dataset
This import includes the Population Count Estimates by Race for the United States from the year 1900 to 2020 on a yearly basis.
This import includes the Population Count Estimates by Race for the United States from the year 1900 to 2023 on a yearly basis.

The population is categorized by Race:
AmericanIndianAndAlaskaNativeAlone,AsianAlone,BlackOrAfricanAmericanAlone,WhiteAlone,NativeHawaiianAndOtherPacificIslanderAlone


### Download URL
The data in txt/csv formats are downloadable from within "https://www2.census.gov/programs-surveys/popest/tables" and "https://www2.census.gov/programs-surveys/popest/datasets". The actual URLs are listed in file_urls.json and file_url.txt(County data from 2020).
The data in txt/csv formats are downloadable from within "https://www2.census.gov/programs-surveys/popest/tables" and "https://www2.census.gov/programs-surveys/popest/datasets". The actual URLs are listed in input_url.json.(County data from 2023).

#### API Output
These are the attributes that we will use
Expand Down Expand Up @@ -52,9 +52,7 @@ Run the test cases

### Import Procedure

The below scripts will download the data
`/bin/python3 scripts/us_census/pep/population_estimate_by_race/download.py`
`sh scripts/us_census/pep/population_estimate_by_race/download1.sh`
The below scripts will download the data and process and script will generate csv and mcf files.
preprocess.py


The below script will generate csv and mcf files.
`/bin/python3 scripts/us_census/pep/population_estimate_by_race/preprocess.py`
12 changes: 10 additions & 2 deletions scripts/us_census/pep/population_estimate_by_race/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def _save_data(_url: str, download_local_path: str) -> None:
df = pd.read_csv(_url, on_bad_lines='skip', names=cols)
df.to_excel(download_local_path + os.sep + file_name,\
index=False,engine='xlsxwriter')
elif "co-est00int-alldata" in _url or "CC-EST2020-ALLDATA" in _url:
elif "co-est00int-alldata" in _url or "CC-EST2020-ALLDATA" in _url or "cc-est2022-all":
df = pd.read_csv(_url,
on_bad_lines='skip',
encoding='ISO-8859-1',
Expand Down Expand Up @@ -144,8 +144,16 @@ def _download(download_path: str, file_urls: list) -> None:
"""
if not os.path.exists(download_path):
os.mkdir(download_path)
all_files = os.listdir(download_path)
for _url in file_urls:
_save_data(_url, download_path)
file_name = _url.split("/")[-1]
if not file_name in all_files:
try:
_save_data(_url, download_path)
except:
print(f"Unable to download {_url}")
else:
print(f"File already downloaded {file_name}")


def main(_):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -654,6 +654,8 @@
"https://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/county/co-est00int-alldata-54.csv",
"https://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/county/co-est00int-alldata-55.csv",
"https://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/county/co-est00int-alldata-56.csv",
"https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA.csv"
"https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA.csv",
"https://www2.census.gov/programs-surveys/popest/datasets/2020-2023/counties/asrh/cc-est2023-alldata.csv"
]
}
}

Loading

0 comments on commit 3ccd9bb

Please sign in to comment.