-
Notifications
You must be signed in to change notification settings - Fork 11
Deduplicate rows filter
The deduplicate rows HXL filter removes duplicate rows from a dataset. You can either require that all columns in a row match, or specify a list of tags that need to match to count as a duplicate (ignore other columns).
The filter will keep the first matching version of each row.
The filter takes zero or more tag patterns as options, specifying the columns that must match to count as a duplicate. If omitted, all columns must match.
Goal: remove duplicate activity reports from a dataset. In this example, we consider two rows duplicates only if all cells match.
Original dataset:
Registro | Sector/Cluster | Subsector | Organización | Hombres | Mujeres | País | Departamento / Provincia / Estado | Dato | |
---|---|---|---|---|---|---|---|---|---|
#record_id | #sector+lang_es | #subsector+lang_es | #org+lang_es | #targeted_num+m | #targeted_num+f | #country | #adm1 | #date | |
0001 | WASH | XHigiene | ACNUR | "1 | 100" | 100 | Panamá | Los Santos | 2015-03-01 |
0002 | Salud | Vacunación | OMS | Colombia | Cauca | ||||
0001 | XXX | XXX | ACNUR | XXX | XXX | XXX | XXX | XXX | |
0001 | WASH | XHigiene | ACNUR | "1 | 100" | 100 | Panamá | Los Santos | 2015-03-01 |
Filtered dataset:
Registro | Sector/Cluster | Subsector | Organización | Hombres | Mujeres | País | Departamento / Provincia / Estado | Dato | |
---|---|---|---|---|---|---|---|---|---|
#record_id | #sector+lang_es | #subsector+lang_es | #org+lang_es | #targeted_num+m | #targeted_num+f | #country | #adm1 | #date | |
0001 | WASH | XHigiene | ACNUR | "1 | 100" | 100 | Panamá | Los Santos | 2015-03-01 |
0002 | Salud | Vacunación | OMS | Colombia | Cauca | ||||
0001 | XXX | XXX | ACNUR | XXX | XXX | XXX | XXX | XXX |
Note that the bottom dataset isn't filtered out, even though it has the same identifier, because some of the other cells are different.
Goal: remove duplicate activity reports from a dataset, where duplicates are any two rows with the same identifier (even if the other cells are different).
Original dataset:
Registro | Sector/Cluster | Subsector | Organización | Hombres | Mujeres | País | Departamento / Provincia / Estado | Dato | |
---|---|---|---|---|---|---|---|---|---|
#record_id | #sector+lang_es | #subsector+lang_es | #org+lang_es | #targeted_num+m | #targeted_num+f | #country | #adm1 | #date | |
0001 | WASH | XHigiene | ACNUR | "1 | 100" | 100 | Panamá | Los Santos | 2015-03-01 |
0002 | Salud | Vacunación | OMS | Colombia | Cauca | ||||
0001 | XXX | XXX | ACNUR | XXX | XXX | XXX | XXX | XXX | |
0001 | WASH | XHigiene | ACNUR | "1 | 100" | 100 | Panamá | Los Santos | 2015-03-01 |
Filtered dataset:
Registro | Sector/Cluster | Subsector | Organización | Hombres | Mujeres | País | Departamento / Provincia / Estado | Dato | |
---|---|---|---|---|---|---|---|---|---|
#record_id | #sector+lang_es | #subsector+lang_es | #org+lang_es | #targeted_num+m | #targeted_num+f | #country | #adm1 | #date | |
0001 | WASH | XHigiene | ACNUR | "1 | 100" | 100 | Panamá | Los Santos | 2015-03-01 |
0002 | Salud | Vacunación | OMS | Colombia | Cauca |
Note that both rows with the identifier '0001' are missing this time.
On the command line, use the hxldedup program (hxldup -h
for help):
usage: hxldedup [-h] [--sheet [number]] [--strip-tags] [-t tag,tag...]
[infile] [outfile]
Remove duplicate rows from a HXL dataset.
positional arguments:
infile HXL file to read (if omitted, use standard input).
outfile HXL file to write (if omitted, use standard output).
optional arguments:
-h, --help show this help message and exit
--sheet [number] Select sheet from a workbook (1 is first sheet)
--strip-tags Strip HXL tags from the CSV output
-t tag,tag..., --tags tag,tag...
Comma-separated list of column tags to use for
deduplication (by default, use all values).
In a Python script, use the dedup method:
source = hxl.data(url).dedup()
You can specify one or more tag patterns to use for deduplication:
source = hxl.data(url).dedup('id')
source = hxl.data(url).dedup(['org', 'sector', 'date'])
In the HXL Proxy, choose the "Deduplicate rows" filter type. You can optionally enter a comma-separated list of tag patterns for row matching:
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com