Shell scripts to get a minimal set of SQL commands for persistent yet efficient storage of RKI data for COVID19. A daily updated dump of the resulting SQL database can be found in the dump
branch.
This repository contains scripts to process RKI data given as unsorted CSV files and create a minimal set of SQL commands to preserve its full history of updates. It consists of four parts:
csv-transform.sh
, which transforms a CSV provided by RKI into a cleaned form that can be imported as the initial state into SQL.csv-sort.sh
, which sorts this CSV file.create-sql-query.sh
, which creates the SQL query to dump data in the same form as a cleaned CSV.patch.sh
, which takes two CSV files and generates SQL commands to get from the current state stored in the SQL table to the new state provided by the RKI dump.
In addition, replay.sh
combines all four parts and allows to automate this process for a given time span. It is for instance used in our GitHub Actions recipe.
Each Bash script comes with its own command line arguments, so simply call them with --help
to get a full list of options.
The overall process to keep track of all data changes consists of three phases: the initialisation creates the SQL table and imports the initial state from a single CSV file provided by RKI. In the second update phase, a set of updates is calculated to keep track of all changes and should be run once a day. To test the correctness of some state, the same scripts can be applied in an optional check phase.
Make sure the SQL user is allowed to read from and write to the data directory (/path/to/data/
in the following usage examples). For the combined replay.sh
, you first need to create a MySQL defaults file, which must be provided as the script's last argument.
First, create the SQL table rki_csv
:
cat create-table.sql | mysql # -u [username] -p [database]
The initial data can be created from a given RKI CSV file as follows:
cat /path/to/data/RKI_COVID19_2021-04-22.csv | ./csv-transform.sh --date=2021-04-22 | ./csv-sort.sh > /path/to/data/RKI_COVID19_2021-04-22_init.csv
Load this CSV into SQL:
LOAD DATA LOCAL INFILE '/path/to/data/RKI_COVID19_2021-04-22_init.csv' INTO TABLE rki_csv CHARACTER SET UTF8 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES;
Create the cleaned and sorted CSV from a given RKI CSV file. This is the same procedure as for the initialisation:
cat /path/to/data/RKI_COVID19_2021-04-23.csv | ./csv-transform.sh --date=2021-04-23 | ./csv-sort.sh > /path/to/data/RKI_COVID19_2021-04-23_init.csv
Create an unsorted CSV dump of the state currently stored in the SQL table:
./create-sql-query.sh --known-before --date=2021-04-23 /path/to/data/RKI_COVID19_2021-04-23_tmp.csv | mysql # -u [username] -p [database]
Sort the generated file:
cat /path/to/data/RKI_COVID19_2021-04-23_tmp.csv | ./csv-sort.sh > /path/to/data/RKI_COVID19_2021-04-23_predump.csv
Compare it to the latest RKI CSV, generate a minimal set of changes, and load them into SQL:
./patch.sh /path/to/data/RKI_COVID19_2021-04-23_predump.csv /path/to/data/RKI_COVID19_2021-04-23_init.csv | mysql # -u [username] -p [database]
To get the data as if it were some given date, just call the query created by ./create-sql-query.sh
:
./create-sql-query.sh --date=2021-04-23 /path/to/data/RKI_COVID19_2021-04-23_dump.csv | mysql # -u [username] -p [database]
The generated file can be checked against the official RKI CSV dump for this date as follows:
diff <(cat /path/to/data/RKI_COVID19_2021-04-23_dump.csv | ./csv-sort.sh --without-metadata) <(cat /path/to/data/RKI_COVID19_2021-04-23_init.csv | ./csv-sort.sh --without-metadata)
The RKI publishes every day a new CSV dump of all COVID19 cases in Germany, where only about 0.2% to 2.0% of all data rows are changed per day. However, only the aggregated CSV dumps are known for synchronisation. In order to get a minimal set of instructions to go from one data version to the other, this repository was created. It adopts ideas and code from the more generic tablediff
tool, which serves a similar purpose for any pair of two CSV dumps.
The Bash scripts use a combination of shell's awk
, sort
, and diff
commands, to split, sort, and compare large CSV files and SQL dumps in a best-effort manner.
Some example SQL queries to get the total number of cases, new cases reported at a specific date, etc., are given in Queries.md
.
We provide a daily updated dump of the resulting SQL database via GitHub Actions. It can be found in the dump
branch.
With a single CSV dump provided by the RKI, you lose all information about previous corrections in the data. If you ever wondered What were the numbers known at this specific date the past?, i.e. the exact numbers that the RKI Dashboard listed, this repository is for you. In particular if you hesitate to store the original CSV files for each day, and instead wish to run ad-hoc queries about all these data with SQL.
The idea is to identify rows that have changed in the RKI CSV dumps of two consecutive days. As of April 2021, this saves around 98% of space. Per day, there are only between 25k and 35k rows that are added, instead of the original 1.7 million rows in the RKI CSV dumps.
The FID
is consistent and unique only within the CSV dump of a single day. Even if the values of data row do not change, it is not guaranteed to have the same FID
in the CSV dump provided by the RKI the next day.
The field DFID
is an integer of 15 digits, where the first 8 are the date when this row first appear in the format YYYYMMDD
, the rest are the corresponding row number, filled up by leading zeroes. I.e., a DFID
of 202104220001813
describes the row 1813 from the RKI CSV file of 2021/04/22.
It makes it easier to sort the CSV dumps by their values via the shell's sort
command.
Some numbers on an i5, 4x 2.30GHz:
- The initialisation phase takes about 75 seconds: ~60sec for
.csv-transform.csv
andcsv-sort.sh
, plus ~15sec for loading the data in SQL. - The update phase takes about 5 minutes: ~1min for
.csv-transform.csv
andcsv-sort.sh
, plus ~3min for./patch.sh
and SQL updates, and less than a minute for the intermediate steps. - The optional check phase takes about 2 minutes, with most of this time spent to calculate the (empty)
diff
.
We constantly refer to the column numbers in our scripts, so the following list might come in useful:
FID
IdBundesland
Bundesland
Landkreis
Altersgruppe
Geschlecht
AnzahlFall
AnzahlTodesfall
Meldedatum
IdLandkreis
Datenstand
NeuerFall
NeuerTodesfall
Refdatum
NeuGenesen
AnzahlGenesen
IstErkrankungsbeginn
Altersgruppe2
After applying csv-transform.sh
, the CSV files are of the following columns:
IdBundesland
IdLandkreis
Meldedatum
(formatYYYY-MM-DD
)Altersgruppe
Geschlecht
NeuerFall
NeuerTodesfall
NeuGenesen
AnzahlFall
AnzahlTodesfall
AnzahlGenesen
Refdatum
(formatYYYY-MM-DD
)IstErkrankungsbeginn
Altersgruppe2
GueltigAb
GueltigBis
DFID
Note that the flag --without-metadata
provided by the scripts csv-transform.sh
and csv-sort.sh
removes the last three columns GueltigAb
, GueltigBis
, and DFID
. This makes it easier to compare two CSV files, as for instance done in the optional check phase.
The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
- Please make sure the SQL user has
FILE
privileges and is allowed to access the specified data directory.
- Please make sure the SQL user has
cat: 2-tmp.csv: Permission denied
- The user that executes the script cannot access the CSV dump created by the MySQL process. Make sure to adjust the user's group or the file permissions of files created by MySQL.