Skip to content

Latest commit

 

History

History
169 lines (114 loc) · 10.4 KB

data-pipeline.md

File metadata and controls

169 lines (114 loc) · 10.4 KB

Data pipeline

To produce our dataset we are constantly developing our dedicated library cowidev. This library provides us with the command tool cowid which eases:

  1. Running several sub-processes (or pipelines) that generate intermediate datasets.
  2. Jointly processing and merging all these intermediate datasets into the final and complete dataset.

Consequently, the dataset is updated multiple times a day (at least at 06:00 and 18:00 UTC), using the latest generated intermediate datasets.

Overview

The dataset pipeline is built from several pipelines, which are executed independently and whose outputs are combined in a final step. The complexity of the pipelines varies. For instance, for vaccinations, testing and hospitalization we are responsible for collecting, processing and publishing the data but for cases/deaths we leave the collection step to the WHO and then transform and publish the data. Note that on 23 June 2022, we stopped adding new data points to our COVID-19 testing dataset (read more)).

The table below lists all the constituent pipelines, along with their execution frequencies, and what are the pipelines' tasks.

Pipeline Frequency Tasks
Vaccinations every weekday at 12:00 UTC {abbr}Collection (Scraping primary sources (e.g. country governmental sites) and extracting relevant datapoints.), {abbr}transformation (Transforming and cleaning the downloaded data into a human-readable format.), {abbr}presentation (Presenting the cleaned data to the public (e.g. charts, dataset files, etc.).)
Testing Phased out (read more) Collection, transformation, presentation
Hospitalization & ICU daily at 06:00 and 18:00 UTC Collection, transformation, presentation
Cases & Deaths daily (multiple times) Transformation, presentation
Excess mortality weekly Transformation, presentation
Variants daily at 20:00 UTC Transformation, presentation
Reproduction rate daily Presentation
Policy responses (OxCGRT) daily Transformation, presentation
Public monitor (YouGov) weekly Transformation, presentation

You can find all the automation details in this file.

Vaccinations pipeline

The vaccination pipeline is probably the most complete one, where we scrape and extract data for each country in the dataset.

The pipeline is executed manually, by @edomt or @lucasrodes every weekday (i.e. Monday until Friday) before 12 UTC.

Execution steps

# Download/scrape data
cowid vax get

# Proces/check data
cowid vax process

# Generate dataset
cowid vax generate

# Integrate into full dataset
cowid vax export

[Intermediate dataset](https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/), including per-country files and data technical details.

Testing pipeline

We scrape and process data for multiple countries, similarly to the vaccinations pipeline. The pipeline is executed manually, by @camapel on Mondays and Fridays.

:::{warning} On 23 June 2022, we stopped adding new datapoints to our COVID-19 testing dataset. We continue to update all other metrics in our COVID-19 dataset. You can read more here. :::

Execution steps

# Download/scrape data
cowid testing get
[Intermediate datasets](https://github.com/owid/covid-19-data/tree/master/public/data/testing)

Hospitalization & ICU pipeline

We scrape and process the data similarly as to what we do for testing and vaccinations. The pipeline is run daily.

Execution steps

# Download data & generate dataset
cowid hosp generate

# Update Grapher-ready files
cowid hosp grapher-io

[Intermediate dataset and data technical details](https://github.com/owid/covid-19-data/tree/master/public/data/hospitalizations).

Cases & Deaths pipeline

We source cases and death figures from the COVID-19 Dashboard by the WHO. We transform some of the variables and re-publish the dataset.

Execution steps

# Generate dataset
cowid casedeath generate

[Intermediate datasets](https://github.com/owid/covid-19-data/tree/master/public/data/cases_deaths).

Excess Mortality pipeline

The pipeline is manually executed once a week. The reported all-cause mortality data is from the Human Mortality Database (HMD) Short-term Mortality Fluctuations project and the World Mortality Dataset (WMD). Both sources are updated weekly. We also present estimates of excess deaths globally that are published by The Economist.

Execution steps

# Download data and generate dataset
cowid xm generate

[Intermediate dataset and data technical details](https://github.com/owid/covid-19-data/tree/master/public/data/excess_mortality).

Variants pipeline

We run this pipeline daily.

Execution steps

# Download data and generate dataset
cowid variants generate

# Update Grapher-ready files
cowid variants grapher-io
The data on variants and sequencing is indeed no longer available to download.
It is published by GISAID under a license that doesn't allow us to redistribute it.
Please visit [the data publisher's website](https://www.gisaid.org/) for more details. You may want to register an account there if you're really interested in using this data.

Reproduction rate pipeline

We source the data from crondonm/TrackingR/.

[_Tracking R of COVID-19 A New Real-Time Estimation Using the Kalman Filter_](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244474), by Francisco Arroyo, Francisco Bullano, Simas Kucinskas, and Carlos Rondón-Moreno

Public monitor (YouGov) pipeline

:::{warning} The YouGov pipeline is under construction. :::