On 23 June 2022, we stopped adding new datapoints to our COVID-19 testing dataset. We continue to update all other metrics in our COVID-19 dataset. You can read more here
We welcome contributions to our testing dataset!
Automated countries can be found under the cowidev.testing
folder. Some countries have a batch collection process while others an incremental one.
- batch: The complete timeseries is updated at every execution. This process is preferred, as it means the source can correct past data.
- incremental: Only the last data point is added.
The code consists of a mixture of Python and R scripts. As we try to slowly move our entire code base to Python, we currently only accept contributions written in Python.
To automate the data import process for a country, make sure that:
- The source is reliable.
- The source provides data in a format that can be easily read:
- As a file (e.g. csv, json, xls, etc.)
- As plain text in source HTML, which can be easily scraped.
- Decide if the import is batch (i.e. the entire time series) or incremental (last value). See the scripts in
cowidev.testing.batch
andcowidev.testing.incremental
for more details. Note: Batch is preferred over incremental. - Create a script in the right location, based on your decision at step 1: either in
cowidev.testing.batch
orcowidev.testing.incremental
. Note that each source is different and there is no single pattern that works for all sources, however you can take some inspiration from the scripts below:- Batch imports:
- CSV: France
- API/JSON: Portugal
- HTML: Bosnia & Herzegovina
- HTML, with JS: Turkey
- Incremental imports:
- CSV: Equatorial Guinea
- HTML: Bahrain
- Batch imports:
- Make sure that you are collecting the right metrics (for more details, read the Metrics collected section).
- Test that the script works and is stable. For this you need to have the library
installed. Run
cowid test get [country-name]
- Create a pull request with your code.
- Limit your pull request to a single country or a single feature.
- We welcome code improvements/bug fixes. As an example, you can read #465.
You can of course, and we appreciate it very much, create pull requests for other cases.
Note that files in the public folder are not to be modified via pull requests.
For each country we collect metadata variables such as:
Country
: Name of the country or territoryDate
: Date of the reported dataUnits
: Units of the reported data. This can be just one ofpeople tested
,tests performed
andsamples tested
. That is, a country file can't contain mixed units.people tested
: Number of people tested.tests performed
: Number of tests performed. A single person can be tested more than once in a given day.samples tested
: Number of samples tested. In some cases, more than one sample may be required to perform a given test.
Source URL
: URL of the source.Source label
: Name of the source.Notes
: Additional notes (optional).
In addition, we may collect one or all of the following two metrics:
Cumulative total
: Cumulative number of people tested, tests performed or samples tested (depending onUnits
).Daily change in cumulative total
: Daily number of new people tested, tests performed or samples tested (depending onUnits
).
Please read the following section to understand better which metric is preferred in each case.
Finally, if we deem it appropriate, we also estimate the positive rate (Positive rate
). This is done whenever we
consider that the data provided by Johns Hopkins University on confirmed cases might not be usable for this purpose (for example because the country doesn't report its cases every day of the week).
-
The ideal situation is to collect
Cumulative total
. From it, we can infer intermediate totals (with a linear forward-fill) and create the 7-day average daily series. So in this situation, our dataset will have Cumulative total and 7-day, which is perfect. -
The second-best situation, if there is no
Cumulative total
, is to collectDaily change in cumulative total
every day instead. If the daily number is really present each day, then our script will calculate the 7-day average.- Examples:
chile.py
,albania.py
- Examples:
-
The "worst" situation is to have an irregular series of Daily change in cumulative total. This is basically of little use, because we can't calculate any cumulative total (because some days are missing) and we also can't calculate the 7-day average (because some days are missing).
- Examples:
moldova.py
- Examples:
Examples: argentina.py
, france.py