Deduplicate rows filter

The deduplicate rows HXL filter removes duplicate rows from a dataset. You can either require that all columns in a row match, or specify a list of tags that need to match to count as a duplicate (ignore other columns).

The filter will keep the first matching version of each row.

Options

The filter takes zero or more tag patterns as options, specifying the columns that must match to count as a duplicate. If omitted, all columns must match.

Examples

Example 1: full matches

Goal: remove duplicate activity reports from a dataset. In this example, we consider two rows duplicates only if all cells match.

Original dataset:

Registro	Sector/Cluster	Subsector	Organización	Hombres	Mujeres	País	Departamento / Provincia / Estado	Dato
#record_id	#sector+lang_es	#subsector+lang_es	#org+lang_es	#targeted_num+m	#targeted_num+f	#country	#adm1	#date
0001	WASH	XHigiene	ACNUR	"1	100"	100	Panamá	Los Santos	2015-03-01
0002	Salud	Vacunación	OMS			Colombia	Cauca
0001	XXX	XXX	ACNUR	XXX	XXX	XXX	XXX	XXX
0001	WASH	XHigiene	ACNUR	"1	100"	100	Panamá	Los Santos	2015-03-01

Filtered dataset:

Registro	Sector/Cluster	Subsector	Organización	Hombres	Mujeres	País	Departamento / Provincia / Estado	Dato
#record_id	#sector+lang_es	#subsector+lang_es	#org+lang_es	#targeted_num+m	#targeted_num+f	#country	#adm1	#date
0001	WASH	XHigiene	ACNUR	"1	100"	100	Panamá	Los Santos	2015-03-01
0002	Salud	Vacunación	OMS			Colombia	Cauca
0001	XXX	XXX	ACNUR	XXX	XXX	XXX	XXX	XXX

Note that the bottom dataset isn't filtered out, even though it has the same identifier, because some of the other cells are different.

Example 2: partial matches

Goal: remove duplicate activity reports from a dataset, where duplicates are any two rows with the same identifier (even if the other cells are different).

Original dataset:

Registro	Sector/Cluster	Subsector	Organización	Hombres	Mujeres	País	Departamento / Provincia / Estado	Dato
#record_id	#sector+lang_es	#subsector+lang_es	#org+lang_es	#targeted_num+m	#targeted_num+f	#country	#adm1	#date
0001	WASH	XHigiene	ACNUR	"1	100"	100	Panamá	Los Santos	2015-03-01
0002	Salud	Vacunación	OMS			Colombia	Cauca
0001	XXX	XXX	ACNUR	XXX	XXX	XXX	XXX	XXX
0001	WASH	XHigiene	ACNUR	"1	100"	100	Panamá	Los Santos	2015-03-01

Filtered dataset:

Registro	Sector/Cluster	Subsector	Organización	Hombres	Mujeres	País	Departamento / Provincia / Estado	Dato
#record_id	#sector+lang_es	#subsector+lang_es	#org+lang_es	#targeted_num+m	#targeted_num+f	#country	#adm1	#date
0001	WASH	XHigiene	ACNUR	"1	100"	100	Panamá	Los Santos	2015-03-01
0002	Salud	Vacunación	OMS			Colombia	Cauca

Note that both rows with the identifier '0001' are missing this time.

Usage

Command line

On the command line, use the hxldedup program (hxldup -h for help):

usage: hxldedup [-h] [--sheet [number]] [--strip-tags] [-t tag,tag...]
                [infile] [outfile]

Remove duplicate rows from a HXL dataset.

positional arguments:
  infile                HXL file to read (if omitted, use standard input).
  outfile               HXL file to write (if omitted, use standard output).

optional arguments:
  -h, --help            show this help message and exit
  --sheet [number]      Select sheet from a workbook (1 is first sheet)
  --strip-tags          Strip HXL tags from the CSV output
  -t tag,tag..., --tags tag,tag...
                        Comma-separated list of column tags to use for
                        deduplication (by default, use all values).

Python

In a Python script, use the dedup method:

source = hxl.data(url).dedup()

You can specify one or more tag patterns to use for deduplication:

source = hxl.data(url).dedup('id')

source = hxl.data(url).dedup(['org', 'sector', 'date'])

HXL Proxy

In the HXL Proxy, choose the "Deduplicate rows" filter type. You can optionally enter a comma-separated list of tag patterns for row matching:

Screenshot of deduplicate filter form in the HXL Proxy.

Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com

Home

For everyone
- Installation
- Command-line tools
For coders
Building blocks
HXL cookbook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly