-
Notifications
You must be signed in to change notification settings - Fork 11
Cleanup recipes
These recipes are part of the HXL cookbook. Each recipe describes a general problem related to data cleanup, then shows how to use the HXL standard and the command-line tools to solve the problem.
Normally, you will put these recipes in a script or batchfile to automate or semi-automate data cleanup.
Problem: your dataset sometimes has extra spaces, which makes sorting and analysis difficult.
Recipe: use hxlclean (command) to normalise whitespace, either for specific columns or for the entire dataset.
(Note: this recipe uses the character "␣" in examples to represent spaces.)
Whitespace normalisation involves removing all leading and trailing space from a value, and then replacing any internal whitespace sequences with a single space. That means that any internal formatting (e.g. newlines, indentation) will be lost, so you should not use this process on fields that contain lists, paragraphs, etc. For example, with whitespace normalisation, the string “␣Food␣␣␣security␣
” becomes “Food␣security
”.
The first way to normalise whitespace is with the -W
or --whitespace-all
option, which applies to every field in the dataset:
hxlclean -W <dataset-in.csv >dataset-out.csv
Sometimes, however, you'll want to be more selective, and normalise only specific columns. In that case, you use the -w
or --whitespace
option followed by a comma-separate list of HXL tags:
hxlclean --whitespace sector,org,adm1 \
<dataset-in.csv >dataset-out.csv
In this example, hxlclean will normalise whitespace only in columns with the #sector, #org, or #adm1 tags. In either case, sorting and analysis become much simpler when “Coastal␣region
” and “Coastal␣␣␣region
” no longer sort as different values.
Problem: your dataset contains codes or standard terms for sorting, but they've been entered inconsistently, e.g. "Hospital" and "HOspital".
Recipe: use hxlclean (command) to normalise both whitespace and character case for the columns affected.
For example, if your dataset has a "Cluster or sector" column (tagged #sector), you can both normalise the whitespace and convert everything to uppercase, so that " fOod Security ", "food security", and "FOOD Security" all end up as "food security" and sort together for analysis, graphs, etc:
hxlclean --whitespace sector --lower sector \
<dataset-in.csv >dataset-out.csv
Problem: users have entered dates inconsistently in a dataset, and they're difficult to sort or analyse.
Recipe: use hxlclean (command) to attempt to normalise the dates.
hxlclean has a --date
(or -d
) option that will attempt to parse dates and normalise them to the form YYYY-MM-DD:
hxlclean --date period_date <dataset-in.csv >dataset-out.csv
This command will take dates like "March 31, 2015", "03-31-15", "31/03/2015", etc, and normalise them all to ISO 8601 standard format "2015-03-31."
Alternatively, you can use the --date-all
(or -D
) option to normalise dates in columns under any HXL hashtag ending with "_date":
hxlclean --date-all <dataset-in.csv >dataset-out.csv
Caution: With the --date
option, you should still review the output carefully. hxlclean will attempt to use your local computer's language/locale and time to make the best guesses possible, and will fill in a missing month or year, but sometimes it simply will be unable to guess: for example, "3/4/15" might mean "2014-03-04" (4 March) or "2014-04-03" (3 April). The only way to be certain about dates is to encourage people to enter them unambiguously in the first place, and it's strongly recommended always to use the international standard ISO 8601 format.
Problem: numbers in your dataset are written inconsistently (for example, a spreadsheet might add two spurious decimal places to every number, so that there are 2500.00 people affected, or human authors might add commas, writing a million as "1,000,000").
Recipe: use hxlclean (command) with the --number
(or -n
) option to simplify and standardise numbers:
hxlclean --number affected_num <dataset-in.csv >dataset-out.csv
Alternatively, you can use the --number-all
(or -N
) option to normalise numbers in any column under a HXL hashtag ending in "_num":
hxlclean --number-all <dataset-in.csv >dataset-out.csv
Number normalisation works as follows:
- Ignore any field that doesn't contain a digit (0–9).
- Remove every character except 0–9 and ".".
- Remove any leading zeros.
- Remove any trailing zeros if there is a decimal.
- Remove a decimal if there are no digits after it.
As a result, "0100000", "100,000", and "100,000.00" will all become "100000".
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com