Skip to content

Notes on the recordlinkage Python library

Marco Fossati edited this page Dec 20, 2018 · 9 revisions

General

Data format

dataset = pandas.DataFrame(
  {
    'catalog_id': [666, 777, 888],
    'name': ['huey', 'dewey', 'louie'],
    ...
  }
)
  • remember the order of values, i.e., 666 -> 'huey';

Cleaning

  • AKA pre-processing AKA normalization AKA standardization;
  • https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html;
  • uses pandas.Series, a list-like object;
  • the clean function seems interesting at a first glimpse;
  • by default, it removes text inside brackets. Might be useful, trivial to re-implement;
  • terrible default regex, removes everything that is not an ASCII letter! Non-ASCII strings are just deleted! Use a custom regex or None in replace_by_none= kwarg to avoid this;
  • nice ASCII folding via strip_accents='ascii', not done by default;
  • strip_accents='unicode' keeps intact some Unicode chars, e.g., œ;
  • non-latin scripts are just not handled;
  • the phonetic function has the same problems as in jellyfish, see #79.
from recordlinkage.preprocessing import clean

names = pandas.Series(
  [
    'хартшорн, чарльз',
    'charles hartshorne',
    'チャールズ・ハートショーン',
    'تشارلز هارتشورن',
    '찰스 하츠혼',
    àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
  ]
)
clean(names)

Output:

0
1    charles hartshorne
2
3
4
5
dtype: object
clean(names, replace_by_none=None, strip_accents='ascii')

Output:

0                                  ,
1                 charles hartshorne
2
3
4
5    aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object

Indexing

  • AKA blocking AKA candidate acquisition;
  • https://recordlinkage.readthedocs.io/en/latest/ref-index.html;
  • make pairs of records to reduce the space complexity (quadratic);
  • a simple call to the Index.block(FIELD) function is not enough for names, as it makes pairs that exactly agree, i.e., like an exact match;
index = recordlinkage.Index()
index.block('name')
candidate_pairs = index.index(source_dataset, target_dataset)

Comparing

comp = recordlinkage.Compare()
comp.string('name', 'label', threshold=3)
feature_vectors = comp.compute(candidate_pairs, source_dataset, target_dataset)
print(feature_vectors.sum(1).value_counts())

Classification

Training workflow

INPUT = training set = existing QIDs with target IDs = dict { QID: target_ID };

  1. get the QID statements from Wikidata;
  2. query MariaDB for target ID data;
  3. load both into 2 pandas.DataFrame;
  4. pre-process;
  5. make the index with blocking -> match_index arg;
  6. feature extraction with comparison -> training_feature_vectors arg.

Naïve Bayes