layout | permalink | description | title | width |
---|---|---|---|---|
default |
recipes |
How matchID supercharges your powers with recipes |
Recipes |
is-10 |
A recipe consist of steps of treatments made on rows and colums.
The dataset is scanned by chunks, which are loaded into a Pandas dataframe.
So a recipe is basically a treatment on a chunk from a dataframe, resulting into a transformed dataframe.
A recipe can call :
- another recipe (and so one, recursively)
- generic recipes (included in the core configuration, but constructed as recipes)
- internals (map, join, eval, replace, keep, delete...) : included in the core code or can also be extended
- the internal "eval" function gives access to create/modify columns based on a row function (e.g col1+col2) into recipes
Summary:
This recipe creates new columns to the dataframe, simply based on others.
- map:
matchid_date_birth_src: datasetAlpha_DATE_NAISSANCE
matchid_date_death_src: datasetAlpha_DATE_DECES
#location
matchid_location: # in this mapping,
- datasetAlpha_COMMUNE_NAISSANCE # ther result will be an array
- datasetAlpha_CODE_INSEE_NAISSANCE # [ datasetAlpha_COMMUNE_NAISSANCE, datasetAlpha_CODE_INSEE_NAISSANCE ]
This recipe keeps only columns matching a regex, with an optional where
condition :
- keep:
select: matchid_.* # selection with a regexp
where : matchid_score>0.2 #eval-like python base expression
or in an explicit list :
- keep:
select: # selection by list
- matchid_name_first
- matchid_name_last
This recipe deletes columns matching a regex :
- delete:
select: datasetAlpha.* # selection with a regexp
or in an explicit list :
- delete:
select: # selection by list
- datasetAlpha_NOM_PRENOMS
- datasetAlpha_DATE_NAISSANCE
Renames columns of a dataframe :
- rename:
source_column1: target_column1
source_column2: target_column2
This is the swiss-knife recipe which evaluates a treatment row by row. A new column value will have a cell value computed with a python expression.
The values of the dataframe are accessible within the row
array.
A particular column
value is available in row['column']
as in column
.
Here's an example:
- eval:
- matchid_name_first: matchid_name_first_src if (type(matchid_name_first_src)==list) else [matchid_name_first_src]
- matchid_name_last: matchid_name_last_src if (type(matchid_name_last_src)==list) else [matchid_name_last_src]
Here are some of the implemented functions.
This methods applies regex on a selection of fields (matching itself a regex), in python style:
- replace:
select: matchid_location_city.* # regex for selection
regex: # ordered list of regex
- ^\s*(lyon|marseille|paris)\s.*$: '\1'
- montreuil s.* bois: montreuil
- (^|\s)ste(\s|$): '\1sainte\2'
- (^|\s)st(\s|$): '\1saint\2'
- ^aix pce$: aix provence
This method transforms a text to lowercase, removes accent and special characters on selected-by-regex fields :
- normalize:
select: matchid_location_city.*
This recipe is an helper for debugging a recipe. It ends prematurely the recipe, not excecuting following steps.
Complement helpers are a selection of fields (like keep) and of top rows (head) to limit the size of the treatment.
- pause:
select: matchid_location_city.*
head: 50
Fully shuffles the data (each column independtly) using np.random.permutation
. Used for anonymization.
This recipe converts a selection-by-regex of columns from string to integers, fills not available value with NaN or a specified value.
- to_integer:
select: ^.*(population|surface).*
fillna: 0
This recipe converts a selection-by-regex of columns from string to floats, fills not available value with NaN or specified value.
- to_float:
select: ^.*(frequency|).*
fillna: 0
Converts a list to tuple, which can be used for indexing in a dataframe (e.g. groupby, etc.) for example.
Converts a tuple to a list.
Computes n-grams of selected columns
- ngram:
select: .*name.*
n: [2, 3, 4] # computes 2-grams, 3-grams and 4-grams
This recipe converts a selection-by-regex of columns from string to a date/time type :
- parsedate :
select: matchid_date_.*
format: "%Y%m%d" # standard python datetime format for parsing
This recipes acts like a SQL join, executed by chunks (so slower), tolerating fuzziness (so better).
This fuzzy join is either in-memory (to match to small referential datasets; ie <500k) or based on elasticsearch (for > 500k to > 100M).
** fuzzy, in-memory **
In the following example we try to match both city label (fuzzily), departement code (strictly) and country iso code (strictly) to recover citycode history of a city :
- join:
type : in_memory
dataset: french_citycodes_fuzzy # referential dataset to match with
fuzzy:
matchid_location_city_norm: norm_name
strict:
matchid_location_depcode: dep_code
matchid_location_countrycode: CODEISO3
select: # selected columns from outer dataset (right) mapped to current dataframe
matchid_location_citycode_history: insee_code_history
matchid_location_city: name
matchid_location_city_geopoint_2d: geopoint_2d
** simple join, in-memory **
This example is more frequent and easier but useful when you have multiple referential datasets (slower than a SQL join but can help to limit the number of in-between datasets) :
- join:
dataset: french_citycodes
type: in_memory
strict:
matchid_location_citycode: insee_code
select:
matchid_location_citycode_history: insee_code_history
** large fuzzy match with elasticsearch **
This last example deals with the problem of big fuzzy match (up to millions against millions).
Of course you'll need a big cluster if you want to deal with many millions of matches in less than a week!
The fuzzy match just relies on pure elasticsearch queries transformed from json to yaml :
- join:
type: elasticsearch
dataset: matchid
query:
size: 10
query: # the uggly raw elasticsearch query
bool:
must:
- bool:
should:
- bool:
must:
- match:
matchid_name_match:
query: matchid_name_last_match
fuzziness: auto
- match:
matchid_date_birth_str: matchid_date_birth_str
- bool:
must:
- match:
matchid_name_match: matchid_name_last_match
- match:
matchid_date_birth_str:
query: matchid_date_birth_str
fuzziness: 1
minimum_should_match: 1
The elasticsearch join can accept some configurations :
unfold: False
(defaultTrue
): each row return a bucket of potential matches.unfold
splits this buckets into rows, like in a SQL-join operation. Ifunfold
isFalse
, buckets are returned raw to enable custom operations.keep_unmatched: True
(default:False
) - ifunfold
isTrue
, keep rows without a match (ifunfold
isFalse
, no analysis of the bucket is done so all rows are kept).unnest: False
(default:True
) - by default, the elasticsearch values are splitted into columns. IfFalse
, the raw elasticsearch hits are returned one by row but in a column 'hit' which contains the json.prefix: myprefix_
(default:hit_
) - customize prefix of the keys from the elasticsearch hits.
This recipe splits a selection-by-regex columns of arrays in to multiple rows, copying the content of the other columns :
- unfold:
select: ^hits
This recipe splits a selection-by-regex columns of jsons to multiple columns, one by key value of the JSON and delete previous columns :
- unnest:
select: ^hits
prefix: hit_ #prefix with 'hit_' the keys for name the columns, default prefix is empty
Gathers the selected columns and value into a json in the target column :
- nest:
select: .*location.*
target: location
This computes a groupby and redispatchzq value across the group :
- groupby:
select: matchid_id
transform:
- score: max
- score: min
rank:
- score
This method should be used without chunks unless you're sure each member of groups fit in same chunk.
This methods applies only on a full dataset an should have the flag chunked: False
in the input dataset, e.g :
train_rnipp_agrippa_rescoring_model:
input:
dataset: rnipp_agrippa
chunked: False # <== this is the option to unckunk your dataset
To build a model, here's the way:
- build_model:
model:
name: rnipp_agrippa_ml # the name as it will be stored in models/ et callable with "apply_model"
library: sklearn.ensemble # library of ml (scikit learn)
method: RandomForestRegressor # method of ml
parameters: # kwargs of the methods
n_estimators: 20
max_depth: 4
min_samples_leaf: 5
test_size: 0.33 # split size for train and test (here 67% for training and 33% for testing)
tries: 3 # numbers of tries for this splitting (here 3 random splits) to test stability
numerical: .*(hit_score.*|population|surface|matchid_hit_distance)$
categorical: .*(matchid_location_countrycode)$
target: validation_decision # name of the column to predict
To apply a model, you need to build one first or use a pre-trained one.
Models can only be applied to same data as in training aka same columns in the same order.
- apply_model:
name: rnipp_agrippa_ml # name of the model to be applied
numerical: .*(hit_score.*|population|surface|matchid_hit_distance)$ # selection-by-regex of numerical columns
categorical: .*(matchid_location_countrycode)$ # selection-by-regex of categorical columns
target: matchid_hit_score_ml # name of the column to save prediction
Maps a string POINT(lon, lat)
to a numerical tuple (lat,lon)
.
Calculates vincenty (from geopy.distance) distance between two wgs84 (lat,lon) tuples.
Replaces all values by key from dictionnary dic
like {"key1": "value1", "key2": "value2"}
into object
which can be a string, an array or a dictionnary (replace into values which strictly match).
Replaces all values by regex from dictionnary dic
like {"regex1": "value1", "regex2": "value2"}
into object
which can be a string, a array or a dictionnary.
object
may be either a string (or unicode) or an array of strings. For each string value, lower-cases the value, removes accents and special characters.
object
may be either a string (or unicode) or an array of strings. For each string value, split with \s+ separator
.
nltk tokenizers could be integrated here later.
flattens the list, ie [[a,b],[c]]
returns [a,b,c]
.
Computes the sha1 hash key of str(object)
.
Computes levenshtein distance between string a and string b. Current version is just a wrapping.
Computes normalized levenshtein distance between string a and string b, or minimum of levenshtein_norm between list of strings a and list of strings b.
Compute minimum jaro-winkler distance between strings of list a and strings of list b.
Computes n-grams of string a (with n in a int list here [2,3]) or n-grams of strings in list a.