Skip to content

Commit

Permalink
rev
Browse files Browse the repository at this point in the history
  • Loading branch information
Edouard-Legoupil committed Nov 8, 2023
1 parent 74d6325 commit 674e641
Show file tree
Hide file tree
Showing 72 changed files with 902 additions and 813 deletions.
46 changes: 39 additions & 7 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,38 @@ options(scipen = 999)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
[![R-CMD-check](https://github.com/impact-initiatives/cleaningtools/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/impact-initiatives/cleaningtools/actions/workflows/R-CMD-check.yaml)
[![codecov](https://codecov.io/gh/impact-initiatives/cleaningtools/branch/master/graph/badge.svg?token=SOH3NGXQDU)](https://codecov.io/gh/impact-initiatives/cleaningtools)

<!-- badges: end -->

The `{cleaningtools}` package focuses on survey data cleaning process. It allow to have a fully documented and reproducible cleaning process, based on the generation of a standardized `cleaning and deletion log`. With such type of process, Quality Assurance and Auditing can be easily performed.

This tool support the implementation of IMPACT Initiatives / REACH guidance: [Data Cleaning Guidelines for Structured Data](https://www.reachresourcecentre.info/wp-content/uploads/2022/05/IMPACT_Data-Cleaning-Guidelines_FINAL_To-share-11.pdf) & [Data Cleaning Minimum Standards Checklist](https://www.reachresourcecentre.info/wp-content/uploads/2020/03/IMPACT_Memo_Data-Cleaning-Min-Standards-Checklist_28012020-1.pdf).

The workflow supported by the tool includes:

The `cleaningtools` package focuses on cleaning, and has three components:
1. Get your raw data and your form from your Kobo/ODK/ONA server.

2. Define a __list of logical checks__ based on the specific content of your form. This is basically an excel spreadsheet defining checks describing incompatible responses (`check_id`, `description`, `check_to_perform`, `columns_to_clean`) - such as "_primary_livelihood is rented but expenses less than 500000_" or "_access water and tank emptied_".

3. Pipe a list of __systematic checks__ functions to apply on the data (_outliers, shortest path, personally identifiable information, duration..._), including the logical checks previously defined - each of the check will produce a specific log.

4. Assemble and export the __`cleaning log`__ together in a dedicated excel spreadsheet (`create_xlsx_cleaning_log()`) so that the person responsible for the cleaning can manually make the decision on the cleaning action to perform among the following values:

**1. Check**, which includes a set of functions that flag values, such as check_outliers and check_logical.
|value|Definition|
|-----|----------|
|`change_response`|Change the response to new.value|
|`blank_response`|Remove and NA the response|
|`remove_survey`|Delete the survey|
|`no_action_value`|No action to take|

**2. Create**, which includes a set of functions to create different items for use in cleaning, such as the cleaning log from the checks, clean data, and enumerator performance.
5. Apply the manually review `cleaning log` on the raw data to obtain the __cleaned data__, aka `checked_dataset`

6. Then __review__ how the cleaning was applied through dedicated report `review_cleaning()` , `review_the_others_log`, `review_sf` for the sampling frame


Please check the package vignette tuto to review the content with more details.

**3. Review**, which includes a set of functions to review the cleaning.

## Installation
## Installation & Usage

You can install the development version from [GitHub](https://github.com/) with:

Expand All @@ -43,10 +61,24 @@ You can install the development version from [GitHub](https://github.com/) with:
devtools::install_github("impact-initiatives/cleaningtools")
```

The package comes with a parameterised report template to ease and speed-up the full process.

Once users have a good understanding of the process above, create an Rstudio projects, install the package, download your data and your form within a dedicated sub-folder for instance `data-raw`, create an excel file to add your `logical checks`, add if any the file defining your `sampling plan`.

Then create a notebook using the `clean` notebook template included in the package and start documenting all the parameters.

Once done you can run each of the code chunk one after the other. After the first chapter, you should have a `cleaning log` file created within your the same `data-raw` folder. Open it and manually set up the cleaning actions for each of the checks.

Run then the last few chunks to apply the log and review the results...

Et Voila, you should have then the `cleaned_data` in your `data-raw` folder

## Current Limitation

The package assumes that the survey data is a single frame, it does not work out of the box with datalist, aka survey dataset that have more than one dataframe

## Code of Conduct

Please note that the cleaningtools project is released with a [Contributor Code of Conduct](https://impact-initiatives.github.io/cleaningtools/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
Please note that the {cleaningtools} project is released with a [Contributor Code of Conduct](https://impact-initiatives.github.io/cleaningtools/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms. For developpers, check the `dev/function_documentation.Rmd` notebook created with [{fusen}](https://thinkr-open.github.io/fusen/index.html)


91 changes: 79 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,59 @@
Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
[![R-CMD-check](https://github.com/impact-initiatives/cleaningtools/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/impact-initiatives/cleaningtools/actions/workflows/R-CMD-check.yaml)
[![codecov](https://codecov.io/gh/impact-initiatives/cleaningtools/branch/master/graph/badge.svg?token=SOH3NGXQDU)](https://codecov.io/gh/impact-initiatives/cleaningtools)

<!-- badges: end -->

The `cleaningtools` package focuses on cleaning, and has three
components:
The `{cleaningtools}` package focuses on survey data cleaning process.
It allow to have a fully documented and reproducible cleaning process,
based on the generation of a standardized `cleaning and deletion log`.
With such type of process, Quality Assurance and Auditing can be easily
performed.

This tool support the implementation of IMPACT Initiatives / REACH
guidance: [Data Cleaning Guidelines for Structured
Data](https://www.reachresourcecentre.info/wp-content/uploads/2022/05/IMPACT_Data-Cleaning-Guidelines_FINAL_To-share-11.pdf)
& [Data Cleaning Minimum Standards
Checklist](https://www.reachresourcecentre.info/wp-content/uploads/2020/03/IMPACT_Memo_Data-Cleaning-Min-Standards-Checklist_28012020-1.pdf).

The workflow supported by the tool includes:

1. Get your raw data and your form from your Kobo/ODK/ONA server.

2. Define a **list of logical checks** based on the specific content of
your form. This is basically an excel spreadsheet defining checks
describing incompatible responses (`check_id`, `description`,
`check_to_perform`, `columns_to_clean`) - such as
*primary_livelihood is rented but expenses less than 500000*” or
*access water and tank emptied*”.

3. Pipe a list of **systematic checks** to apply on the data
(*outliers, shortest path, personally identifiable information,
duration…*), including the logical checks previously defined - each
of the check will produce a specific log.

4. Assemble and export the **`cleaning log`** together in a dedicated
excel spreadsheet (`create_xlsx_cleaning_log()`) so that the person
responsible for the cleaning can manually make the decision on the
cleaning action to perform among the following values:

| value | Definition |
|-------------------|----------------------------------|
| `change_response` | Change the response to new.value |
| `blank_response` | Remove and NA the response |
| `remove_survey` | Delete the survey |
| `no_action_value` | No action to take |

**1. Check**, which includes a set of functions that flag values, such
as check_outliers and check_logical.
5. Apply the manually review `cleaning log` on the raw data to obtain
the **cleaned data**, aka `checked_dataset`

**2. Create**, which includes a set of functions to create different
items for use in cleaning, such as the cleaning log from the checks,
clean data, and enumerator performance.
6. Then **review** how the cleaning was applied through dedicated
report `review_cleaning()` , `review_the_others_log`, `review_sf`
for the sampling frame

**3. Review**, which includes a set of functions to review the cleaning.
Please check the package vignette tuto to review the content with more
details.

## Installation
## Installation & Usage

You can install the development version from
[GitHub](https://github.com/) with:
Expand All @@ -32,9 +69,39 @@ You can install the development version from
devtools::install_github("impact-initiatives/cleaningtools")
```

The package comes with a parameterised report template to ease and
speed-up the full process.

Once users have a good understanding of the process above, create an
Rstudio projects, install the package, download your data and your form
within a dedicated sub-folder for instance `data-raw`, create an excel
file to add your `logical checks`, add if any the file defining your
`sampling plan`.

Then create a notebook using the `clean` notebook template included in
the package and start documenting all the parameters.

Once done you can run each of the code chunck one after the other. After
the first chapter, you should have the `cleaning log` file created
within your the same `data-raw` folder. Open it and manually set up the
cleaning action for each of the check.

Run then the last few chunks to apply the log and review the results…

Et Voila, you should have then the `cleaned_data` in your `data-raw`
folder

## Current Limitation

The package assumes that the survey data is a single frame, it does not
work out of the box with datalist, aka survey dataset that have more
than one dataframe

## Code of Conduct

Please note that the cleaningtools project is released with a
Please note that the {cleaningtools} project is released with a
[Contributor Code of
Conduct](https://impact-initiatives.github.io/cleaningtools/CODE_OF_CONDUCT.html).
By contributing to this project, you agree to abide by its terms.
By contributing to this project, you agree to abide by its terms. For
developpers, check the `dev/function_documentation.Rmd` notebook created
with [{fusen}](https://thinkr-open.github.io/fusen/index.html)
Binary file modified data-raw/logical_check_list.xlsx
Binary file not shown.
Binary file modified data-raw/review.xlsx
Binary file not shown.
3 changes: 1 addition & 2 deletions docs/404.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 1 addition & 2 deletions docs/CODE_OF_CONDUCT.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 1 addition & 2 deletions docs/LICENSE-text.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 1 addition & 2 deletions docs/LICENSE.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 674e641

Please sign in to comment.