Skip to content

lockedata/linelist

 
 

Repository files navigation

Welcome to the linelist package!

Travis build status Codecov test coverage

This package is dedicated to simplifying the cleaning and standardisation of line list data. Considering a case line list data.frame, it aims to:

  • standardise the variables names, replacing all non-ascii characters with their closest latin equivalent, removing blank spaces and other separators, enforcing lower case capitalisation, and using a single separator between words

  • standardise the labels used in all variables of type character and factor, as above

  • set POSIXct and POSIXlt to Date objects

  • extract dates from a messy variable, automatically detecting formats, allowing inconsistent formats, and dates flanked by other text

Installing the package

To install the current stable, CRAN version of the package, type:

install.packages("linelist")

To benefit from the latest features and bug fixes, install the development, github version of the package using:

devtools::install_github("reconhub/linelist")

Note that this requires the package devtools installed.

Quick example

Let us consider a messy data.frame as an example:

library(linelist)
example_data <- messy_data(10)
example_data
#>       'ID Date of Onset. DisCharge.. GENDER_  Épi.Case_définition
#> 1  cmfvdr     2018-01-07  17/01/2018     male         suspected  
#> 2  opwfwa     2018-01-07  17/01/2018   FEMALE          not a case
#> 3  zsjyva     2018-01-09  19/01/2018     MALE            PROBABLE
#> 4  exqmoe     2018-01-09  19/01/2018   female         suspected  
#> 5  vxvrqq     2018-01-09  19/01/2018     male            probable
#> 6  yrtvea     2018-01-02  12/01/2018     Male            probable
#> 7  tnnlfm     2018-01-10  20/01/2018     male           suspected
#> 8  ihpqxz     2018-01-10  20/01/2018     Male            PROBABLE
#> 9  feabyd     2018-01-06  16/01/2018   female           confirmed
#> 10 bpfkiu     2018-01-03  13/01/2018     male          not a case
#>           messy/dates      lat      lon
#> 1          2018_10_17 13.34652 48.84905
#> 2                <NA> 14.87742 46.98764
#> 3  that's 24/12/1989! 13.66359 49.67178
#> 4          2018_10_17 12.67856 49.25075
#> 5          2018-10-18 12.66741 47.02325
#> 6     // 24//12//1989 13.66111 48.93061
#> 7          2018-10-18 13.60725 47.34525
#> 8                <NA> 10.87389 47.50097
#> 9          2018-10-18 15.94802 48.30479
#> 10 that's 24/12/1989! 14.25785 48.17981

We then use the clean_data() command to get nice, clean data!

clean_data(example_data, guess_dates = TRUE)
#>        id date_of_onset  discharge gender epi_case_definition messy_dates
#> 1  cmfvdr    2018-01-07 2018-01-17   male           suspected  2018-10-17
#> 2  opwfwa    2018-01-07 2018-01-17 female          not_a_case        <NA>
#> 3  zsjyva    2018-01-09 2018-01-19   male            probable  1989-12-24
#> 4  exqmoe    2018-01-09 2018-01-19 female           suspected  2018-10-17
#> 5  vxvrqq    2018-01-09 2018-01-19   male            probable  2018-10-18
#> 6  yrtvea    2018-01-02 2018-01-12   male            probable  1989-12-24
#> 7  tnnlfm    2018-01-10 2018-01-20   male           suspected  2018-10-18
#> 8  ihpqxz    2018-01-10 2018-01-20   male            probable        <NA>
#> 9  feabyd    2018-01-06 2018-01-16 female           confirmed  2018-10-18
#> 10 bpfkiu    2018-01-03 2018-01-13   male          not_a_case  1989-12-24
#>         lat      lon
#> 1  13.34652 48.84905
#> 2  14.87742 46.98764
#> 3  13.66359 49.67178
#> 4  12.67856 49.25075
#> 5  12.66741 47.02325
#> 6  13.66111 48.93061
#> 7  13.60725 47.34525
#> 8  10.87389 47.50097
#> 9  15.94802 48.30479
#> 10 14.25785 48.17981

What does it do?

Procedures to clean data, first and foremost aimed at data.frame formats, include:

  • clean_data(): the main function, taking a data.frame as input, and doing all the variable names, internal labels, and date processing described above

  • clean_variable_names(): like clean_data, but only the variable names

  • clean_variable_labels(): like clean_data, but only the variable labels

  • clean_variable_spelling(): provided with a dictionary, will correct the spelling of values in a variable and can globally correct commonly mis-spelled words.

  • clean_dates(): like clean_data, but only the dates

  • guess_dates(): find dates in various, unspecified formats in a messy character vector

Getting help online

Bug reports and feature requests should be posted on github using the issue system. All other questions should be posted on the RECON forum:
http://www.repidemicsconsortium.org/forum/

Contributions are welcome via pull requests.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%