This package is dedicated to simplifying the cleaning and
standardisation of line
list data. Considering
a case line list data.frame
, it aims to:
-
standardise the variables names, replacing all non-ascii characters with their closest latin equivalent, removing blank spaces and other separators, enforcing lower case capitalisation, and using a single separator between words
-
standardise the labels used in all variables of type
character
andfactor
, as above -
set
POSIXct
andPOSIXlt
toDate
objects -
extract dates from a messy variable, automatically detecting formats, allowing inconsistent formats, and dates flanked by other text
To install the current stable, CRAN version of the package, type:
install.packages("linelist")
To benefit from the latest features and bug fixes, install the development, github version of the package using:
devtools::install_github("reconhub/linelist")
Note that this requires the package devtools installed.
Let us consider a messy data.frame
as an example:
library(linelist)
example_data <- messy_data(10)
example_data
#> 'ID Date of Onset. DisCharge.. GENDER_ Épi.Case_définition
#> 1 cmfvdr 2018-01-07 17/01/2018 male suspected
#> 2 opwfwa 2018-01-07 17/01/2018 FEMALE not a case
#> 3 zsjyva 2018-01-09 19/01/2018 MALE PROBABLE
#> 4 exqmoe 2018-01-09 19/01/2018 female suspected
#> 5 vxvrqq 2018-01-09 19/01/2018 male probable
#> 6 yrtvea 2018-01-02 12/01/2018 Male probable
#> 7 tnnlfm 2018-01-10 20/01/2018 male suspected
#> 8 ihpqxz 2018-01-10 20/01/2018 Male PROBABLE
#> 9 feabyd 2018-01-06 16/01/2018 female confirmed
#> 10 bpfkiu 2018-01-03 13/01/2018 male not a case
#> messy/dates lat lon
#> 1 2018_10_17 13.34652 48.84905
#> 2 <NA> 14.87742 46.98764
#> 3 that's 24/12/1989! 13.66359 49.67178
#> 4 2018_10_17 12.67856 49.25075
#> 5 2018-10-18 12.66741 47.02325
#> 6 // 24//12//1989 13.66111 48.93061
#> 7 2018-10-18 13.60725 47.34525
#> 8 <NA> 10.87389 47.50097
#> 9 2018-10-18 15.94802 48.30479
#> 10 that's 24/12/1989! 14.25785 48.17981
We then use the clean_data()
command to get nice, clean data!
clean_data(example_data, guess_dates = TRUE)
#> id date_of_onset discharge gender epi_case_definition messy_dates
#> 1 cmfvdr 2018-01-07 2018-01-17 male suspected 2018-10-17
#> 2 opwfwa 2018-01-07 2018-01-17 female not_a_case <NA>
#> 3 zsjyva 2018-01-09 2018-01-19 male probable 1989-12-24
#> 4 exqmoe 2018-01-09 2018-01-19 female suspected 2018-10-17
#> 5 vxvrqq 2018-01-09 2018-01-19 male probable 2018-10-18
#> 6 yrtvea 2018-01-02 2018-01-12 male probable 1989-12-24
#> 7 tnnlfm 2018-01-10 2018-01-20 male suspected 2018-10-18
#> 8 ihpqxz 2018-01-10 2018-01-20 male probable <NA>
#> 9 feabyd 2018-01-06 2018-01-16 female confirmed 2018-10-18
#> 10 bpfkiu 2018-01-03 2018-01-13 male not_a_case 1989-12-24
#> lat lon
#> 1 13.34652 48.84905
#> 2 14.87742 46.98764
#> 3 13.66359 49.67178
#> 4 12.67856 49.25075
#> 5 12.66741 47.02325
#> 6 13.66111 48.93061
#> 7 13.60725 47.34525
#> 8 10.87389 47.50097
#> 9 15.94802 48.30479
#> 10 14.25785 48.17981
Procedures to clean data, first and foremost aimed at data.frame
formats, include:
-
clean_data()
: the main function, taking adata.frame
as input, and doing all the variable names, internal labels, and date processing described above -
clean_variable_names()
: likeclean_data
, but only the variable names -
clean_variable_labels()
: likeclean_data
, but only the variable labels -
clean_variable_spelling()
: provided with a dictionary, will correct the spelling of values in a variable and can globally correct commonly mis-spelled words. -
clean_dates()
: likeclean_data
, but only the dates -
guess_dates()
: find dates in various, unspecified formats in a messycharacter
vector
Bug reports and feature requests should be posted on github using the
issue system. All other
questions should be posted on the RECON forum:
http://www.repidemicsconsortium.org/forum/
Contributions are welcome via pull requests.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.