This repo contains resources for dealing with data problems. In particular, the tool Open Refine.
- Open Refine
- Crosswalker - allows you to match datasets where the matching columns may not match entirely
- Dataproofer
- Gender API and Genderize.io (also available as an R package) can be used to classify names by gender
- Parserator can be used to split addresses
- The tidyverse in R: a collection of libraries for making data 'tidy'. Also janitor
- Python has pandas, and pyjanitor
- Workbench (no longer supported but can be run using Docker) - includes much of Refine's functionality including clustering
- Data Wrangler (no longer supported)
- I keep a list of data cleaning tools on Pinboard
- Numbers/dates treated as strings (often because of currency or percentage signs, or even spaces - try find and replace)
- Strings treated as numbers: e.g. company ‘numbers’, phone numbers and codes often have leading zeroes removed when they are an integral part of the code.
- Numbers and units combined in sentences structures
- Combined data (addresses)
- Different data in one column (country, region and authority, for example, with spaces or formatting used to indicate the difference)
- Variant spellings
- Inconsistently entered info (e.g. £5k vs £5,000)
- Different terms for same thing
- Mistypings - missing decimals etc.
- Merged cells
- Empty rows
- Headings across multiple rows
- Converted PDFs
- Missing information
- Duplicate information
- Format
- Need to extract information - e.g. first name/surname; street name/region; year/month
- Need to classify information - e.g. male vs female name
I keep a series of bookmarked materials on cleaning using Pinboard at https://pinboard.in/u:paulbradshaw/t:cleaning
See the dirtydata folder in this repo for examples of dirty data.
This sample dirty dataset can be used for basic data cleaning in Open Refine
The European Investment Bank database can be downloaded (look for Export to Excel near the bottom) and provides a useful example of data where dates are formatted as strings.
I also bookmark examples of dirty data at https://pinboard.in/u:paulbradshaw/t:dirtydata
For working with XML files try the ones that can be downloaded from the Food Standards Agency API page
For JSON files try petition.parliament.uk - go to any petition and look for the JSON link at the bottom of the page.
- Open Refine's 'common transforms' features for basic cleaning
- How to: clean up spreadsheet headings that run across multiple rows using Open Refine
- Converting JSON or XML into spreadsheets using Open Refine
- Tony Hirst explains Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column
- I've written about Scraping data with Google Refine
- Former MA Online Journalism student Ion Mates explains how he used some of Refine's more advanced functionality, and regex, in his post on cleaning a converted PDF
- Former MA Online Journalism student Cristian Giuletti explains How to: combine multiple rows in a dataset where text is split across them
- More at https://onlinejournalismblog.com/tag/google-refine/page/2/
- Grabbing or checking the first, middle or last part of a piece of information: RIGHT, LEFT and MID
- What day did that date fall on? Which year was the worst? Extracting days, months and years from full dates
- Converting sentences to numbers using a spreadsheet
Tutorials: Cleaning in R (dedicated folder here)
- Cleaning and reshaping data on firefighters
- Cleaning data while importing
- Reshaping wide to long: police misconduct data
- Reshaping long to wide and filtering data and calculating year on year changes and percentages
- Emily Halford's guide to cleaning with the Janitor package
- Matt Waite's Gitbook on data journalism, which covers cleaning in chapter 10...
- ...Janitor in chapter 11 and
- ...Refinr, Open Refine in R in chapter 12