Skip to content

Merge columns filter

David Megginson edited this page Sep 1, 2015 · 2 revisions

TODO: update to include Python and the HXL Proxy as well.

The hxlmerge command-line tool creates a new HXL dataset that combines two other datasets. It starts with a primary dataset, then adds fields from another based on shared keys. For example, you could extract the #adm1 field from a second dataset listing administrative subdivisions, and add it the a 3W dataset, based on a shared #adm1_id (e.g. a p-code).

Along with hxlcount (command), this is one of the most difficult of the tools to learn, but also one of the most powerful, because it lets you create something entirely new.

Usage

usage: hxlmerge [-h] -m filename -k tag,tag... -t tag,tag... [-r] [-O]
                [infile] [outfile]

Merge part of one HXL dataset into another.

positional arguments:
  infile                HXL file to read (if omitted, use standard input).
  outfile               HXL file to write (if omitted, use standard output).

optional arguments:
  -h, --help            show this help message and exit
  -m filename, --merge filename
                        HXL file to write (if omitted, use standard output).
  -k tag,tag..., --keys tag,tag...
                        HXL tag(s) to use as a shared key.
  -t tag,tag..., --tags tag,tag...
                        Comma-separated list of column tags to include from
                        the merge dataset.
  -r, --replace         Replace empty values in existing columns (when
                        available) instead of adding new ones.
  -O, --overwrite       Used with --replace, overwrite existing values.

Keys

The command will take the first value for each tag provided in the --keys (or -k) argument to match the datasets. In most cases, you'll provide just one key, but it's possible to use several, e.g.

--keys sector,subsector

This would include values from the merge dataset where both the #sector and #subsector tags have the same value in the primary dataset.

Overwriting

By default, this command adds new columns to the right side of a dataset. There are, however, two options that allow replacing values in existing columns.

By itself, the --replace (or -r) option will fill in empty values, but leave non-empty values alone. For example, in a row that has a value for #adm1, nothing will happen; in a row that has a blank cell for #adm1, the command will fill in the value from the merge dataset. This is useful for a multi-stage merge (try merge dataset 1, then merge dataset 2 if nothing found, etc.).

With the --overwrite (or -O) option, this command will always overwrite the value in an existing cell, even if the value from the merge dataset is empty.

Examples

Input file: 3w-in.csv

#org #sector #adm1_id
OXFAM WASH 003
CARE WASH 001
Save the Children Education 001
Red Cross Protection 002
UNHCR Shelter 001

Merge file: adm.csv

#adm1_id #adm1 #people_num
001 Coastal Region 300000
002 Mountain Region 120000
003 Capital Region 650000

Command:

hxlmerge --merge adm.csv --keys adm1_id --tags adm1,people_num \
  <3w-in.csv >3w-out.csv

Output file: 3w-out.csv

#org #sector #adm1_id #adm1 #people_num
OXFAM WASH 003 Capital Region 650000
CARE WASH 001 Coastal Region 300000
Save the Children Education 001 Coastal Region 300000
Red Cross Protection 002 Mountain Region 120000
UNHCR Shelter 001 Coastal Region 300000