Skip to content

Deduplicate rows filter

David Megginson edited this page Sep 28, 2015 · 4 revisions

The deduplicate rows HXL filter removes duplicate rows from a dataset. You can either require that all columns in a row match, or specify a list of tags that need to match to count as a duplicate (ignore other columns).

The filter will keep the first matching version of each row.

Options

The filter takes zero or more tag patterns as options, specifying the columns that must match to count as a duplicate. If omitted, all columns must match.

Examples

Example 1: full matches

Goal: remove duplicate activity reports from a dataset. In this example, we consider two rows duplicates only if all cells match.

Original dataset:

Registro Sector/Cluster Subsector Organización Hombres Mujeres País Departamento / Provincia / Estado Dato
#record_id #sector+lang_es #subsector+lang_es #org+lang_es #targeted_num+m #targeted_num+f #country #adm1 #date
0001 WASH XHigiene ACNUR "1 100" 100 Panamá Los Santos 2015-03-01
0002 Salud Vacunación OMS Colombia Cauca
0001 XXX XXX ACNUR XXX XXX XXX XXX XXX
0001 WASH XHigiene ACNUR "1 100" 100 Panamá Los Santos 2015-03-01

Filtered dataset:

Registro Sector/Cluster Subsector Organización Hombres Mujeres País Departamento / Provincia / Estado Dato
#record_id #sector+lang_es #subsector+lang_es #org+lang_es #targeted_num+m #targeted_num+f #country #adm1 #date
0001 WASH XHigiene ACNUR "1 100" 100 Panamá Los Santos 2015-03-01
0002 Salud Vacunación OMS Colombia Cauca
0001 XXX XXX ACNUR XXX XXX XXX XXX XXX

Note that the bottom dataset isn't filtered out, even though it has the same identifier, because some of the other cells are different.

Example 2: partial matches

Goal: remove duplicate activity reports from a dataset, where duplicates are any two rows with the same identifier (even if the other cells are different).

Original dataset:

Registro Sector/Cluster Subsector Organización Hombres Mujeres País Departamento / Provincia / Estado Dato
#record_id #sector+lang_es #subsector+lang_es #org+lang_es #targeted_num+m #targeted_num+f #country #adm1 #date
0001 WASH XHigiene ACNUR "1 100" 100 Panamá Los Santos 2015-03-01
0002 Salud Vacunación OMS Colombia Cauca
0001 XXX XXX ACNUR XXX XXX XXX XXX XXX
0001 WASH XHigiene ACNUR "1 100" 100 Panamá Los Santos 2015-03-01

Filtered dataset:

Registro Sector/Cluster Subsector Organización Hombres Mujeres País Departamento / Provincia / Estado Dato
#record_id #sector+lang_es #subsector+lang_es #org+lang_es #targeted_num+m #targeted_num+f #country #adm1 #date
0001 WASH XHigiene ACNUR "1 100" 100 Panamá Los Santos 2015-03-01
0002 Salud Vacunación OMS Colombia Cauca

Note that both rows with the identifier '0001' are missing this time.

Usage

Command line

On the command line, use the hxldedup program (hxldup -h for help):

usage: hxldedup [-h] [--sheet [number]] [--strip-tags] [-t tag,tag...]
                [infile] [outfile]

Remove duplicate rows from a HXL dataset.

positional arguments:
  infile                HXL file to read (if omitted, use standard input).
  outfile               HXL file to write (if omitted, use standard output).

optional arguments:
  -h, --help            show this help message and exit
  --sheet [number]      Select sheet from a workbook (1 is first sheet)
  --strip-tags          Strip HXL tags from the CSV output
  -t tag,tag..., --tags tag,tag...
                        Comma-separated list of column tags to use for
                        deduplication (by default, use all values).

Python

In a Python script, use the dedup method:

source = hxl.data(url).dedup()

You can specify one or more tag patterns to use for deduplication:

source = hxl.data(url).dedup('id')
source = hxl.data(url).dedup(['org', 'sector', 'date'])

HXL Proxy

In the HXL Proxy, choose the "Deduplicate rows" filter type. You can optionally enter a comma-separated list of tag patterns for row matching:

Screenshot of deduplicate filter form in the HXL Proxy.