-
Notifications
You must be signed in to change notification settings - Fork 11
HXL cookbook
This section contains a series of libhxl-python recipes — simple ways to use the command-line tools together to accomplish common information-management tasks on datasets encoded using the HXL standard.
These recipes work for any kind of HXL data, even data that uses tags outside the core standard. That means that you can use them, and the library and tools, even if the HXL standard doesn't current have tags for the kinds of data you're working with.
- Cleanup recipes — automating tasks in data cleaning.
- Privacy recipes — transforming data to reduce or eliminate personally-identifiable information.
- Reporting recipes — extracting intelligence from data, or alerting when things change.
- Quality-assurance recipes — automatically finding potential errors in data.
- Interoperability recipes — transforming data to work better with other data.
To try these recipes, you can add HXL hashtags to your own datasets, or download some of the HXL datasets available in the Humanitarian Data Exchange or the HXL showcase.
The rest of this article contains tips and tricks that you can use with any of the recipes in the cookbook. There's no need to read any further at first — you probably want to jump straight to the recipes — but you might want to come back later to find ideas for how to deploy those recipes for real-world use.
Both Unix (e.g. Linux or MacOS) and Windows supports a command pipeline, though it can be a bit unreliable on Windows. In a pipeline, the output of one command becomes the input of the next one, so you can construct fairly-complex data-processing routines with only a line or two of typing.
Since most of the HXL command-line tools accept HXL data as their input and produce HXL data as their output, they are especially good candidates for filters in a pipeline. Most of the recipes linked to from this cookbook rely on pipelines to perform complex tasks by combining simple components. For example, the following pipeline has three steps:
- Filter a HXL dataset to contain only the rows where #adm1 is "Coast".
- Filter the result of the above step to contain only the rows where #sector is "WASH".
- Count the number of rows for each #org in that region.
(In these examples, the "" character appears to split up long lines.)
hxlselect -q adm1=Coast my-data.csv | \
hxlselect -q sector=WASH | \
hxlcount -t org
There is no single HXL command-line tool to produce a report counting organisations for a specific sector in a specific top-level administrative subdivision, but a simple pipeline of the utilities produces precisely that result.
When you use the same recipes frequently, it's a good idea to put them in a shell script (called a "batch file" in Windows) to create a simple, new command to do precisely the work you want. For example, the following Unix shell script takes the pipeline example above — count organisations in a specific top-level administrative region and sector — and makes it into a new command (this is a fairly-naive script, without error handling):
#!/bin/sh
# Usage: count-orgs <adm1> <sector> <dataset>
adm1=$1
sector=$2
dataset=$3
hxlselect -q adm1="$adm1" $dataset | \
hxlselect -q sector="$sector" | \
hxlcount -t org
If you save this into the file count-orgs
and make it executable, then you can count organisations in a sector with a single, simple command:
count-orgs Coast WASH my-data.csv > my-report.csv
Writing shell scripts for the data recipes that you use most is an important part of building up a data toolset for automating your work.
Operating systems used for web servers always have some kind of scheduling system, that will execute commands automatically at specific times. These job schedulers allow you to automate running the kinds of scripts shown above, for example, to produce a daily report from a HXL dataset and make it available online.
This example uses the Unix cron scheduler to run the script from the previous section every morning at 3:00 am, depositing the result in the directory /srv/www/my-site/hxl/org-report.csv
. You create a new cronjob in Unix by running the command crontab -e
and then adding a line like the following to the file:
0 3 * * * /bin/sh /var/local/scripts/count-orgs Coast WASH /var/local/hxl/3w-data.csv > /srv/www/my-site/hxl/org-report.csv
Read the documentation for your operating system's job scheduler for more information (for Windows, see the Windows Task Scheduler).
TODO
Since all of the command-line tools can read data from standard input, you can process data live from an online source by using a utility like wget or cURL, both of which you may find pre-installed on Unix-like systems such as Linux or MacOS (you can also download them for Windows).
For example, the following pipeline will read an online HXL file and pass it directly to hxlcount (command) to filter rows for a specific ADM1, then to hxlcount (command) to produce a summary:
wget -q -O - http://example.org/hxl/3w-data.csv | \
hxlselect -q adm1_id=010203 | \
hxlcount -t org,sector
Using cURL, you could also upload the result back to an HTTP server (if it accepts uploads), allowing you to create complex online data transforms with a couple of lines of shell script.
(Google link formats may change. This information was last verified 2014-12-22.)
One of the easiest ways to host HXL data online is to use Google Sheets. Any tab in a Google sheet is instantly available as a live CSV download, as long as you make the sheet publicly-readable.
The CSV download URL for a tab of a public Google Sheet has the following pattern:
https://docs.google.com/spreadsheets/d/<sheet-id>/export?format=csv&gid=<tab-id>
The sheet id is a long identifier like "1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc", while the tab id is a short identifier like "2077872077." The Google Sheet URLs you see in your web browser's address bar will take various forms, but you will always find these two identifiers buried in them somewhere. For example, if you see a spreadsheet URL like https://docs.google.com/spreadsheets/d/1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc/edit#gid=2077872077 you can extract the two identifiers to create the CSV download URL https://docs.google.com/spreadsheets/d/1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc/export?format=csv&gid=2077872077 — after that, you can use it with wget or a similar utility, as described above, to start your HXL processing pipeline:
wget -q -O 'https://docs.google.com/spreadsheets/d/1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc/export?format=csv&gid=2077872077' | \
hxlcount -t adm1
As of 2014-12-22, this command produces the following output (counting Ebola treatment locations in Guinea top-level administrative regions):
#adm1,#x_total_num
Boké,121
Conakry,126
Faranah,120
Kankan,179
Kankan ,5
Kindia,174
Labe,2
Labé,143
Mamou,79
Mamou ,3
N'zérékoré,267
Nzerekore,7
Note that this list shows the value of even simply HXL processing, by revealing inconsistencies in how the data is encoded.
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com