The Philippines Department of Health has made COVID-19 related data publicly available as part of its mandate to promoting transparency and accountability in governance. The Philippines COVID-19 Data Drop is distributed via Google Drive with latest updated data provided daily (generally). Data from previous days are archived and also made available through Google Drive. This package provides a coherent, robust and performant API to the latest and archived Philippines COVID-19 data.
In early April 2020 as part of CoMo
Philippines’s contribution to the CoMo
Consortium’s
work in COVID-19 modelling for the Philippines context, we developed an
R package called
comoparams
that provided
R functions for access, handling and
processing of relevant data required for COVID-19 modelling parameters
for the Philippines. Within the
comoparams
package was a set
of functions developed to interface with the then newly-announced
Philippines Department of Health Data Drop for
COVID-19.
From then until late June 2020, we continued maintenance of the
comoparams
package.
Specifically, the set of functions for interfacing with the DoH Data
Drop required several functions updates (see history of issues and
changes to the package
here) in
relation to noted issues with the DoH Data Drop system and the data
it provides. A good discussion of some of these issues by the UP
COVID-19 Pandemic Response Team can be found
here.
Other issues that required syntax-breaking changes to the
comoparams
package functions
were related to how the DoH Data Drop was released and distributed.
Earlier during its first release, the DoH Data Drop was distributed
through Google Sheets via a single non-changing link. This was for
both latest release and the archived data. A few weeks later, the DoH
Data Drop was moved to Google Drive again via a single non-changing
link for both the latest and the archive data. A month or so after,
release of the latest DoH Data Drop was distributed through a single
link that changed during every release (usually daily). The archive
data, on the other hand, was distributed through a different but
constant link. The DoH Data Drop
distribution system has since remained as such up to date though the
archive data now only includes the current month archive compared to
what it was in late June 2020 when all previous months of archive data
were available.
Currently, given the still persistent issues raised with regard to the DoH Data Drop system and the datasets it distributes, we felt it was important to create a separate focused R package just for accessing, handling and processing of the DoH Data Drop that can be universally used by other R users regardless of their ultimate end-use for the data (e.g., reporting, visualisation, modelling, etc.). We also wanted to contribute to the work of fellow scientists and researchers in the Philippines whom we are familiar with and whom we know use R for the epidemiologic analysis work they perform which they share with the general public in the Philippines. From our own experiences of dealing with the DoH Data Drop, we felt that it would be extremely helpful for fellow scientists and researchers who use R to have consistent and performant data access, handling and processing functions to interface with the DoH Data Drop that can potentially reduce the daily and/or weekly workload of producing COVID-19 analysis and reports and streamline their routine analytical workflows.
To this end, we took inspiration from the functions we developed in the
comoparams
package and
developed this covidphdata
package based on the following
R package design principles:
-
use of a modular and refactorable approach to determining and developing the functions required;
-
creation of an R-based robust and performant application programme interface or API to the Google Drive-based DoH Data Drop;
-
application of modern data handling and processing techniques available in R; and,
-
output of coherent and compliant data structures ready for various applications.
Finally, we want to contribute to both the public discourse and to the
practice of open data and open science in the Philippines and
encourage others to do the same. Our group’s codebase for our work on
COVID-19 modelling (primarily in R
language for statistical computing) is available publicly via
GitHub including the one for the
covidphdata
package (see https://github.com/como-ph/covidphdata).
Our hope is that by creating this package and making its use available
to all R users and its codebase open
source, we can standardise accessing, handling and processing of the
DoH Data Drop thereby providing transparency to the approaches
applied to the data before it is analysed or visualised.
The covidphdata
primarily provides functions that serve as low-level
wrappers to specific googledrive
package functions that support the access to files and folders contained
in a Google Drive given that the DoH Data Drop is distributed
through this platform.
Currently, the covidphdata
package:
-
Provides functions (
datadrop_id*
) to dynamically retrieve unique DoH Data Drop file and folder identifiers used by Google Drive; -
Provides a function (
datadrop_ls
) to list the files and folders within a specified DoH Data Drop folder in Google Drive; -
Provides a function (
datadrop_download
) to download a specified file within DoH Data Drop in Google Drive; and, -
Provides functions (
datadrop_get
) to retrieve a specified file within DoH Data Drop in Google Drive into R.
covidphdata
is not yet available on
CRAN.
The development version of covidphdata
can be installed via GitHub
using the remotes
package:
if(!require(remotes)) install.packages("remotes")
remotes::install_github("como-ph/covidphdata")
The DoH Data Drop is distributed via Google Drive as shared
folders and files that can be accessed by anyone with a link. From a
user who accesses the DoH Data Drop interactively (i.e., clicking
the DoH Data Drop daily link and then accessing the folders and
files on Google Drive on the browser), this would mean that one
doesn’t have to be an authorised user to access the files. On the
other hand, for an R user, it would be
convenient to be able to access and retrieve the DoH Data Drop
folders and files within R without having
to do separate steps of dealing with going to a browser and downloading
the data from the DoH Data Drop before being able to access them in
R. This is the main use case for the
covidphdata
package.
The covidphdata
is built on the key functionalities of the
googledrive
package and follows the basic workflow of data retrieval
from Google Drive. The googledrive
package requires authenticating
with Google Drive and has functions that facilitate this specifically
its drive_auth
function. So, the easiest way to authenticate as a
Google Drive user is to make a call to the drive_auth
function as
follows:
## Load googledrive package
library(googledrive)
## Authenticate with Google
drive_auth()
This function when called will open the browser to allow the user to
login their username and password to their Google account. This is the
easiest approach to authenticate as most users will be able to implement
this easily. Also, this step will only have to be done per session which
will be enough to use covidphdata
to retrieve the data one needs into
R. After that, all operations one needs to
do will be with dealing with the data that is within
R which doesn’t require accessing Google
Drive again. However, it is possible that if at some point later within
the same session you need to access different dataset from the archive
that you will need to authenticate again. If this is the case, you will
be asked on your R console to confirm the
email account that you previously logged in with and your
R session will continue to use those
authentication details.
This is the most straightforward and easiest way to authenticate with
Google Drive for most users. However, if you prefer a non-interactive
way of authenticating, read this good
article
from the googledrive
package creators on the alternative ways to
authenticate that doesn’t require interactive input.
Once authenticated, one can now proceed with the covidphdata
data
retrieval workflow outlined below.
The different functions currently available through covidphdata
are
linked through the following general workflow:
## Load covidphdata and googledrive package
library(covidphdata)
library(googledrive)
## Step 1: Get Google Drive ID for latest DoH Data Drop
gid <- datadrop_id()
## Step 2: List the files and folders available in the latest DoH Data Drop
data_list <- datadrop_ls(id = gid)
## Step 3: Retrieve the specified/required dataset and load into R
datadrop_get(tbl = data_list, fn = "Case Information", path = tempfile())
This workflow produces the following output:
#> # A tibble: 416,852 x 22
#> CaseCode Age AgeGroup Sex DateSpecimen DateResultRelea… DateRepConf
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 C581144 53 50 to 54 MALE "2020-07-15" "2020-07-16" 2020-07-21
#> 2 C352619 56 55 to 59 FEMA… "" "" 2020-05-04
#> 3 C466590 36 35 to 39 MALE "2020-07-14" "2020-07-17" 2020-07-20
#> 4 C971812 44 40 to 44 MALE "2020-05-07" "" 2020-05-21
#> 5 C445586 57 55 to 59 MALE "2020-07-14" "2020-07-16" 2020-07-19
#> 6 C354376 34 30 to 34 FEMA… "2020-04-14" "2020-04-16" 2020-09-01
#> 7 C583142 54 50 to 54 MALE "2020-07-20" "2020-07-25" 2020-08-02
#> 8 C329630 56 55 to 59 FEMA… "2020-07-20" "2020-07-22" 2020-07-27
#> 9 C349878 22 20 to 24 MALE "2020-07-22" "2020-07-26" 2020-07-30
#> 10 C956795 48 45 to 49 FEMA… "2020-07-30" "2020-08-01" 2020-08-04
#> # … with 416,842 more rows, and 15 more variables: DateDied <chr>,
#> # DateRecover <chr>, RemovalType <chr>, Admitted <chr>, RegionRes <chr>,
#> # ProvRes <chr>, CityMunRes <chr>, CityMuniPSGC <chr>, BarangayRes <chr>,
#> # BarangayPSGC <chr>, HealthStatus <chr>, Quarantined <chr>, DateOnset <chr>,
#> # Pregnanttab <chr>, ValidationStatus <chr>
The functions in the covidphdata
workflow were designed to be usable
with piped operations provided by the
magrittr
package. The pipe operator
%>%
can be used with the covidphdata
workflow as follows:
## Load magrittr package
library(magrittr)
## Retrieve latest Case Information dataset from DoH Data Drop
datadrop_id() %>% ## Step 1
datadrop_ls() %>% ## Step 2
datadrop_get(fn = "Case Information", ## Step 3
path = tempfile())
This outputs this result:
#> # A tibble: 416,852 x 22
#> CaseCode Age AgeGroup Sex DateSpecimen DateResultRelea… DateRepConf
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 C581144 53 50 to 54 MALE "2020-07-15" "2020-07-16" 2020-07-21
#> 2 C352619 56 55 to 59 FEMA… "" "" 2020-05-04
#> 3 C466590 36 35 to 39 MALE "2020-07-14" "2020-07-17" 2020-07-20
#> 4 C971812 44 40 to 44 MALE "2020-05-07" "" 2020-05-21
#> 5 C445586 57 55 to 59 MALE "2020-07-14" "2020-07-16" 2020-07-19
#> 6 C354376 34 30 to 34 FEMA… "2020-04-14" "2020-04-16" 2020-09-01
#> 7 C583142 54 50 to 54 MALE "2020-07-20" "2020-07-25" 2020-08-02
#> 8 C329630 56 55 to 59 FEMA… "2020-07-20" "2020-07-22" 2020-07-27
#> 9 C349878 22 20 to 24 MALE "2020-07-22" "2020-07-26" 2020-07-30
#> 10 C956795 48 45 to 49 FEMA… "2020-07-30" "2020-08-01" 2020-08-04
#> # … with 416,842 more rows, and 15 more variables: DateDied <chr>,
#> # DateRecover <chr>, RemovalType <chr>, Admitted <chr>, RegionRes <chr>,
#> # ProvRes <chr>, CityMunRes <chr>, CityMuniPSGC <chr>, BarangayRes <chr>,
#> # BarangayPSGC <chr>, HealthStatus <chr>, Quarantined <chr>, DateOnset <chr>,
#> # Pregnanttab <chr>, ValidationStatus <chr>
The resulting dataset retrieved by the piped operation is exactly the same as the first example. However, the piped operation workflow is more streamlined. Either of these approaches can be used when retrieving datasets from the DoH Data Drop.
The Philippines Department of Health (DoH) currently distributes the latest Data Drop via a fixed shortened URL (bit.ly/DataDropPH) which links/points to a new Google Drive endpoint daily or whenever the daily updated DoH Data Drop is available. This Google Drive endpoint is a README document in portable document format (PDF) which contains a privacy and confidentiality statement, technical notes with regard to the latest data, technical notes with regard to previous (archive data) and two shortened URLs - one linking to the Google Drive folder that contains all the latest officially released datasets, and the other linking to the datasets released previously (archives). Of these, the first shortened URL linking to the Google Drive folder containing the latest officially released datasets is different for every release and can only be obtained through the README document released for a specific day.
The function datadrop_id_latest()
reads that PDF file, extracts the
shortened URL for the latest official released datasets written in that
file, expands that shortened URL and then extracts the unique Google
Drive ID for the latest officially released datasets. With this Google
Drive ID, other functions can then be used to retrieve information and
data from the Google Drive specified by this ID.
The DoH Data Drop archives, on the other hand, is distributed via a fixed shortened URL (bit.ly/DataDropArchives) which links/points to a Google Drive folder containing the previous DoH Data Drop releases.
The function datadrop_id_archive
expands that shortened URL and then
extracts the unique Google Drive ID for the DoH Data Drop archives
folder. With this Google Drive ID, other functions can then be used to
retrieve information and data from the Google Drive specified by this
ID.
The function datadrop_id
wraps these two datadrop_id_*
functions and
outputs the appropriate Google Drive ID based on specified parameters.
The default output of datadrop_id
is the Google Drive ID of the
latest officially released datasets as shown below:
datadrop_id()
#> [1] "1N_u-9xjcatJJ08JqUKwpq2eFku_LWOjl"
To get the Google Drive ID of a DoH Data Drop archive, the following parameters need to be set:
datadrop_id(version = "archive", .date = "2020-11-01")
#> [1] "1O2Gt_MUPKtWWPK6ainagRiehK0pCH7Nv"
Once an appropriate Google Drive ID is obtained, this can be used with
the datadrop_ls
function to retrieve a listing of all the files and
folders within the specified Google Drive:
## List the files and folders inside the latest Google Drive DoH Data Drop
gid <- datadrop_id()
datadrop_ls(id = gid)
#> # A tibble: 15 x 3
#> name id drive_resource
#> * <chr> <chr> <list>
#> 1 DOH COVID Data Drop_ 20201121 - 09 Quara… 1hWAgqykG3kPKT5Fgr… <named list [3…
#> 2 DOH COVID Data Drop_ 20201121 - 04 Case … 1nyEEw4hrwHuq3Oc1y… <named list [3…
#> 3 DOH COVID Data Drop_ 20201121 - 02 Metad… 1Umc2_nxW2I1ZLjmUA… <named list [3…
#> 4 DOH COVID Data Drop_ 20201121 - 08 Quara… 1RGyLcAevZZ2YITRS7… <named list [3…
#> 5 DOH COVID Data Drop_ 20201121 - 10 DOH D… 16byfiN1KVYENxhW39… <named list [3…
#> 6 DOH Data Drop 20201121 - Changelog.xlsx 1TAVweGNepRkvGOtIX… <named list [3…
#> 7 DOH COVID Data Drop_ 20201121 - 06 DOH D… 1Epa-Bz4WdjXK9rLH-… <named list [3…
#> 8 DOH COVID Data Drop_ 20201121 - 03 Metad… 1_WxunmnV9PLAHzVtZ… <named list [3…
#> 9 DOH COVID Data Drop_ 20201121 - 05 DOH D… 1AA6keECjIwUQkzVet… <named list [3…
#> 10 DOH COVID Data Drop_ 20201121 - 07 Testi… 1yJqWFOpB-X7gBkzSH… <named list [3…
#> 11 DOH COVID Data Drop_ 20201121 - 12 DDC T… 1pMrknQpIR7XJuLeFA… <named list [3…
#> 12 DOH COVID Data Drop_ 20201121 - 02 Metad… 1CDs8lJ-aM-MnWQOlR… <named list [3…
#> 13 DOH COVID Data Drop_ 20201121 - 03 Metad… 1rlhqnLZyU3Ri_COgC… <named list [3…
#> 14 DOH COVID Data Drop_ 20201121 - 11 DOH D… 1OtL3G_bN2Ri1C6-WZ… <named list [3…
#> 15 01 READ ME FIRST (11_21).pdf 1ywcS448kmrWZra2vu… <named list [3…
## List the files and folders inside the Google Drive DoH Data Drop on
## 1 November 2020
gid <- datadrop_id(version = "archive", .date = "2020-11-01")
datadrop_ls(id = gid)
#> # A tibble: 13 x 3
#> name id drive_resource
#> * <chr> <chr> <list>
#> 1 DOH COVID Data Drop_ 20201101 - 02 Metad… 1m8zDRFJ5qKHky60zb… <named list [3…
#> 2 DOH Data Drop 20201101 - Changelog.xlsx 1KHWj0Mo__y0UqkXfm… <named list [3…
#> 3 01 READ ME FIRST (11_01).pdf 1sipJeeWgMqmS5cYv5… <named list [3…
#> 4 DOH COVID Data Drop_ 20201101 - 12 DDC T… 1DRYzj6m7HALBM2sVy… <named list [3…
#> 5 DOH COVID Data Drop_ 20201101 - 06 DOH D… 1XAo_lunz5HXZpGnF1… <named list [3…
#> 6 DOH COVID Data Drop_ 20201101 - 09 Quara… 1YarEezOL_wtdtBhFZ… <named list [3…
#> 7 DOH COVID Data Drop_ 20201101 - 11 DOH D… 1_Y8dx3ogvJN0RMHYv… <named list [3…
#> 8 DOH COVID Data Drop_ 20201101 - 04 Case … 1G3RITnBxmO0qIYAOG… <named list [3…
#> 9 DOH COVID Data Drop_ 20201101 - 10 DOH D… 16KEnIvD4ol4oQrtu8… <named list [3…
#> 10 DOH COVID Data Drop_ 20201101 - 08 Quara… 19qpm1UseQrSmELji8… <named list [3…
#> 11 DOH COVID Data Drop_ 20201101 - 03 Metad… 1a4yFn9jxGgDQ3_StU… <named list [3…
#> 12 DOH COVID Data Drop_ 20201101 - 05 DOH D… 1D6sO0tNudLENeiMzK… <named list [3…
#> 13 DOH COVID Data Drop_ 20201101 - 07 Testi… 1eW8C062WTfMJC9mn2… <named list [3…
Finally, using the specific Google Drive ID for the file of interest,
the datadrop_id_file
function is used to retrieve the file and output
it into R:
## Retrieve the latest Case Information file in the DoH Data Drop
gid <- datadrop_id()
tab <- datadrop_ls(id = gid)
datadrop_get(tbl = tab, fn = "Case Information", path = tempfile())
#> # A tibble: 416,852 x 22
#> CaseCode Age AgeGroup Sex DateSpecimen DateResultRelea… DateRepConf
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 C581144 53 50 to 54 MALE "2020-07-15" "2020-07-16" 2020-07-21
#> 2 C352619 56 55 to 59 FEMA… "" "" 2020-05-04
#> 3 C466590 36 35 to 39 MALE "2020-07-14" "2020-07-17" 2020-07-20
#> 4 C971812 44 40 to 44 MALE "2020-05-07" "" 2020-05-21
#> 5 C445586 57 55 to 59 MALE "2020-07-14" "2020-07-16" 2020-07-19
#> 6 C354376 34 30 to 34 FEMA… "2020-04-14" "2020-04-16" 2020-09-01
#> 7 C583142 54 50 to 54 MALE "2020-07-20" "2020-07-25" 2020-08-02
#> 8 C329630 56 55 to 59 FEMA… "2020-07-20" "2020-07-22" 2020-07-27
#> 9 C349878 22 20 to 24 MALE "2020-07-22" "2020-07-26" 2020-07-30
#> 10 C956795 48 45 to 49 FEMA… "2020-07-30" "2020-08-01" 2020-08-04
#> # … with 416,842 more rows, and 15 more variables: DateDied <chr>,
#> # DateRecover <chr>, RemovalType <chr>, Admitted <chr>, RegionRes <chr>,
#> # ProvRes <chr>, CityMunRes <chr>, CityMuniPSGC <chr>, BarangayRes <chr>,
#> # BarangayPSGC <chr>, HealthStatus <chr>, Quarantined <chr>, DateOnset <chr>,
#> # Pregnanttab <chr>, ValidationStatus <chr>
## Retrieve the archive Case Information file in the DoH Data Drop
## on 1 November 2020
gid <- datadrop_id(version = "archive", .date = "2020-11-01")
tab <- datadrop_ls(id = gid)
datadrop_get(tbl = tab, fn = "Case Information", path = tempfile())
#> # A tibble: 383,113 x 22
#> CaseCode Age AgeGroup Sex DateSpecimen DateResultRelea… DateRepConf
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 C545959 23 20 to 24 FEMA… "2020-08-15" "2020-08-16" 2020-08-19
#> 2 C542194 40 40 to 44 MALE "2020-07-28" "2020-07-30" 2020-08-03
#> 3 C805584 31 30 to 34 MALE "" "" 2020-07-31
#> 4 C656786 51 50 to 54 FEMA… "2020-07-13" "2020-07-14" 2020-07-17
#> 5 C345520 23 20 to 24 MALE "" "" 2020-05-20
#> 6 C700565 50 50 to 54 MALE "2020-07-29" "2020-07-31" 2020-08-05
#> 7 C290814 68 65 to 69 MALE "2020-05-23" "" 2020-05-30
#> 8 C197757 34 30 to 34 MALE "2020-07-16" "2020-07-20" 2020-07-23
#> 9 C679290 49 45 to 49 MALE "2020-06-03" "2020-06-04" 2020-06-07
#> 10 C667799 28 25 to 29 FEMA… "2020-07-10" "2020-07-15" 2020-08-08
#> # … with 383,103 more rows, and 15 more variables: DateDied <chr>,
#> # DateRecover <chr>, RemovalType <chr>, Admitted <chr>, RegionRes <chr>,
#> # ProvRes <chr>, CityMunRes <chr>, CityMuniPSGC <chr>, BarangayRes <chr>,
#> # BarangayPSGC <chr>, HealthStatus <chr>, Quarantined <chr>, DateOnset <chr>,
#> # Pregnanttab <chr>, ValidationStatus <chr>
The covidphdata
package is in active development. Following are the
planned development milestones for covidphdata
in order of priority:
-
Develop data retrieval functions - done; undergoing use tests by author and other potential users.
-
Develop basic data cleaning, processing and structuring functions - It should be noted that the current functions available for the
covidphdata
data retrieval workflow gets the data from the DoH Data Drop as is; that is no other processing or manipulation is done to the data. This component will be served by another set of functions for cleaning, processing and structuring of the retrieved datasets. These functions are currently under development. -
Develop complex data checking and quality assessment functions - Depending on feedback from users and fellow scientists and researchers using the package for their COVID-19-related R workflows, it would be good to include more advanced and complex data checking and quality assessments which may include partly-supervised or semi-automated duplicate records detection, data consistency checks, re-generation of dropped variables. Once we are able to learn what users need in their usual workflows, we’ll plan which ones to develop robust functions for.
-
Any other functionalities suggested by users - These will be considered on a case-by-case basis.
-
Submit to CRAN - Once package reaches at least a or a development lifecycle, prepare and submit
covidphdata
package for inclusion in the Comprehensive R Archive Network (CRAN).
The main limitations and challenges in the development of the
covidphdata
package are the uncertainties and instabilities of the
DoH Data Drop. As described above, since the launch of the DoH
Data Drop, several interface changes have been implemented that
changed the way the DoH Data Drop is distributed and can be
accessed. These types of changes will almost always require a syntax
breaking update to the current data retrieval functions. The latest
README document also indicates that the archive data has been
limited to the current month (November 2020) because of ongoing efforts
to transfer the archive data to a new website. This may indicate that
archive data will be distributed via this new website and that latest
updates may still be released via Google Drive. Depending on what
these changes end up being, updates to the syntax of current data
retrieval functions is likely. We will ensure that we monitor and watch
out for these changes and adapt the functions as soon as possible to
ensure compatibility to the new interfaces.
Feedback, bug reports and feature requests are welcome; file issues or seek support here. If you would like to contribute to the package, please see our contributing guidelines.
This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.