-
Notifications
You must be signed in to change notification settings - Fork 11
Merge columns filter
TODO: update to include Python and the HXL Proxy as well.
The hxlmerge command-line tool creates a new HXL dataset that combines two other datasets. It starts with a primary dataset, then adds fields from another based on shared keys. For example, you could extract the #adm1 field from a second dataset listing administrative subdivisions, and add it the a 3W dataset, based on a shared #adm1_id (e.g. a p-code).
Along with hxlcount (command), this is one of the most difficult of the tools to learn, but also one of the most powerful, because it lets you create something entirely new.
usage: hxlmerge [-h] -m filename -k tag,tag... -t tag,tag... [-r] [-O]
[infile] [outfile]
Merge part of one HXL dataset into another.
positional arguments:
infile HXL file to read (if omitted, use standard input).
outfile HXL file to write (if omitted, use standard output).
optional arguments:
-h, --help show this help message and exit
-m filename, --merge filename
HXL file to write (if omitted, use standard output).
-k tag,tag..., --keys tag,tag...
HXL tag(s) to use as a shared key.
-t tag,tag..., --tags tag,tag...
Comma-separated list of column tags to include from
the merge dataset.
-r, --replace Replace empty values in existing columns (when
available) instead of adding new ones.
-O, --overwrite Used with --replace, overwrite existing values.
The command will take the first value for each tag provided in the --keys
(or -k
) argument to match the datasets. In most cases, you'll provide just one key, but it's possible to use several, e.g.
--keys sector,subsector
This would include values from the merge dataset where both the #sector and #subsector tags have the same value in the primary dataset.
By default, this command adds new columns to the right side of a dataset. There are, however, two options that allow replacing values in existing columns.
By itself, the --replace
(or -r
) option will fill in empty values, but leave non-empty values alone. For example, in a row that has a value for #adm1, nothing will happen; in a row that has a blank cell for #adm1, the command will fill in the value from the merge dataset. This is useful for a multi-stage merge (try merge dataset 1, then merge dataset 2 if nothing found, etc.).
With the --overwrite
(or -O
) option, this command will always overwrite the value in an existing cell, even if the value from the merge dataset is empty.
Input file: 3w-in.csv
#org | #sector | #adm1_id |
---|---|---|
OXFAM | WASH | 003 |
CARE | WASH | 001 |
Save the Children | Education | 001 |
Red Cross | Protection | 002 |
UNHCR | Shelter | 001 |
Merge file: adm.csv
#adm1_id | #adm1 | #people_num |
---|---|---|
001 | Coastal Region | 300000 |
002 | Mountain Region | 120000 |
003 | Capital Region | 650000 |
Command:
hxlmerge --merge adm.csv --keys adm1_id --tags adm1,people_num \
<3w-in.csv >3w-out.csv
Output file: 3w-out.csv
#org | #sector | #adm1_id | #adm1 | #people_num |
---|---|---|---|---|
OXFAM | WASH | 003 | Capital Region | 650000 |
CARE | WASH | 001 | Coastal Region | 300000 |
Save the Children | Education | 001 | Coastal Region | 300000 |
Red Cross | Protection | 002 | Mountain Region | 120000 |
UNHCR | Shelter | 001 | Coastal Region | 300000 |
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com