This repository contains scripts and data for identifying and correcting discrepancies in the COVIDcast data between the epimetric_latest
and epidata_full
tables. The goal is to ensure that the latest updates are accurately reflected in the epimetric_latest
table.
Due to the dynamic nature of epidemiological data, discrepancies can arise where the latest update in the epimetric_latest
table does not match the actual latest values in the epidata_full
table. This can lead to inaccuracies in the data representation.
To address this issue, we have developed a methodology outlined in the correction script (fix_data.py
). This script identifies offending entries in the df_latest
table, retrieves the actual latest values from the df_full
table, and updates the df_latest
table accordingly. The script also includes validation steps to ensure the correctness and completeness of the fix.
epimetric_full.csv
: Contains data similar to thedf_full
table.epimetric_latest.csv
: Contains data similar to thedf_latest
table.fix_data.py
: Python script for correcting discrepancies in the data.fixed_epimetric_latest.csv
: Output CSV file containing the fixed version of theepimetric_latest_fixed
table.
- Clone this repository to your local machine.
- Navigate to the repository directory.
- Run the
fix_data.py
script to execute the correction process. - Verify the correctness of the fixed
epimetric_latest.csv
file or use it directly in your database. - Process data manipulation using python, jupyter notebook and SQL on docker.
- Python 3.x
- pandas library
- Jupyter notebook
- ydata-profiling
- matplotlib
docker run --rm -p 3307:3307 -e MYSQL_ROOT_PASSWORD=strong_password -e MYSQL_DATABASE=delphi -e MYSQL_USER=foo -e MYSQL_PASSWORD=bar -v $(pwd)/database/init/:/run/init -v $(pwd)/data/:/run/init/data -v $(pwd)/database/my.cnf:/etc/mysql/my.cnf --name delphi_db mysql:latest
docker exec delphi_db /bin/sh -c 'chmod +x /run/init/initialize_database.sh && ./run/init/initialize_database.sh'
docker exec -it delphi_db mysql --user=foo --password=bar
Using SQL to indentify the outdated data from epimetric_latest and generate the SQL named after get_outdate_from_el.sql.