Skip to content

Resources on dealing with dirty data problems and using Open Refine

Notifications You must be signed in to change notification settings

paulbradshaw/cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dealing with dirty data problems (and using Open Refine)

This repo contains resources for dealing with data problems. In particular, the tool Open Refine.

Data cleaning tools

Typical data problems

  • Numbers/dates treated as strings (often because of currency or percentage signs, or even spaces - try find and replace)
  • Strings treated as numbers: e.g. company ‘numbers’, phone numbers and codes often have leading zeroes removed when they are an integral part of the code.
  • Numbers and units combined in sentences structures
  • Combined data (addresses)
  • Different data in one column (country, region and authority, for example, with spaces or formatting used to indicate the difference)
  • Variant spellings
  • Inconsistently entered info (e.g. £5k vs £5,000)
  • Different terms for same thing
  • Mistypings - missing decimals etc.
  • Merged cells
  • Empty rows
  • Headings across multiple rows
  • Converted PDFs
  • Missing information
  • Duplicate information
  • Format
  • Need to extract information - e.g. first name/surname; street name/region; year/month
  • Need to classify information - e.g. male vs female name

I keep a series of bookmarked materials on cleaning using Pinboard at https://pinboard.in/u:paulbradshaw/t:cleaning

Examples of dirty data

See the dirtydata folder in this repo for examples of dirty data.

This sample dirty dataset can be used for basic data cleaning in Open Refine

The European Investment Bank database can be downloaded (look for Export to Excel near the bottom) and provides a useful example of data where dates are formatted as strings.

I also bookmark examples of dirty data at https://pinboard.in/u:paulbradshaw/t:dirtydata

For working with XML files try the ones that can be downloaded from the Food Standards Agency API page

For JSON files try petition.parliament.uk - go to any petition and look for the JSON link at the bottom of the page.

Tutorials: Open Refine

A series of introductory guides to Open Refine can be found in the GitHub repo for one of my modules at Birmingham City University here

Tutorials: Cleaning in spreadsheets

Tutorials: Cleaning in R (dedicated folder here)

About

Resources on dealing with dirty data problems and using Open Refine

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published