A python implementation of a variety of text distance and similarity metrics.
Requirements: py>=3.3
Install Command: pip install pytextdist
The functions in this package takes two strings as input and return the distance/similarity metric between them. The preprocessing of the strings are included in the functions with default recommendation. If you want to change the preprocessing see Customize Preprocessing.
By default functions in this module consider single character as the unit for editting.
Levenshtein Distance & Similarity: edit with insertion, deletion, and substitution
from pytextdist.edit_distance import levenshtein_distance, levenshtein_similarity
str_a = 'kitten'
str_b = 'sitting'
dist = levenshtein_distance(str_a,str_b)
simi = levenshtein_similarity(str_a,str_b)
print(f"Levenshtein Distance:{dist:.0f}\nLevenshtein Similarity:{simi:.2f}")
>> Levenshtein Distance:3
>> Levenshtein Similarity:0.57
Longest Common Subsequence Distance & Similarity: edit with insertion and deletion
from pytextdist.edit_distance import lcs_distance, lcs_similarity
str_a = 'kitten'
str_b = 'sitting'
dist = lcs_distance(str_a,str_b)
simi = lcs_similarity(str_a,str_b)
print(f"LCS Distance:{dist:.0f}\nLCS Similarity:{simi:.2f}")
>> LCS Distance:5
>> LCS Similarity:0.62
Damerau-Levenshtein Distance & Similarity: edit with insertion, deletion, substitution, and transposition of two adjacent units
from pytextdist.edit_distance import damerau_levenshtein_distance, damerau_levenshtein_similarity
str_a = 'kitten'
str_b = 'sitting'
dist = damerau_levenshtein_distance(str_a,str_b)
simi = damerau_levenshtein_similarity(str_a,str_b)
print(f"Damerau-Levenshtein Distance:{dist:.0f}\nDamerau-Levenshtein Similarity:{simi:.2f}")
>> Damerau-Levenshtein Distance:3
>> Damerau-Levenshtein Similarity:0.57
Hamming Distance & Similarity: edit with substition; note that hamming metric only works for phrases of the same lengths
from pytextdist.edit_distance import hamming_distance, hamming_similarity
str_a = 'kittens'
str_b = 'sitting'
dist = hamming_distance(str_a,str_b)
simi = hamming_similarity(str_a,str_b)
print(f"Hamming Distance:{dist:.0f}\nHamming Similarity:{simi:.2f}")
>> Hamming Distance:3
>> Hamming Similarity:0.57
Jaro & Jaro-Winkler Similarity: edit with transposition
from pytextdist.edit_distance import jaro_similarity, jaro_winkler_similarity
str_a = 'sitten'
str_b = 'sitting'
simi_j = jaro_similarity(str_a,str_b)
simi_jw = jaro_winkler_similarity(str_a,str_b)
print(f"Jaro Similarity:{simi_j:.2f}\nJaro-Winkler Similarity:{simi_jw:.2f}")
>> Jaro Similarity:0.85
>> Jaro-Winkler Similarity:0.91
By default functions in this module use unigram (single word) to build vectors and calculate similarity. You can change n
to other numbers for n-gram (n contiguous words) vector similarity.
from pytextdist.vector_similarity import cosine_similarity
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = cosine_similarity(phrase_a, phrase_b, n=1)
simi_2 = cosine_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Cosine Similarity:{simi_1:.2f}\nBigram Cosine Similarity:{simi_2:.2f}")
>> Unigram Cosine Similarity:0.65
>> Bigram Cosine Similarity:0.38
from pytextdist.vector_similarity import jaccard_similarity
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = jaccard_similarity(phrase_a, phrase_b, n=1)
simi_2 = jaccard_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Jaccard Similarity:{simi_1:.2f}\nBigram Jaccard Similarity:{simi_2:.2f}")
>> Unigram Jaccard Similarity:0.47
>> Bigram Jaccard Similarity:0.22
from pytextdist.vector_similarity import sorensen_dice_similarity
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = sorensen_dice_similarity(phrase_a, phrase_b, n=1)
simi_2 = sorensen_dice_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Sorensen Dice Similarity:{simi_1:.2f}\nBigram Sorensen Dice Similarity:{simi_2:.2f}")
>> Unigram Sorensen Dice Similarity:0.64
>> Bigram Sorensen Dice Similarity:0.36
from pytextdist.vector_similarity import qgram_similarity
phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = qgram_similarity(phrase_a, phrase_b, n=1)
simi_2 = qgram_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Q-Gram Similarity:{simi_1:.2f}\nBigram Q-Gram Similarity:{simi_2:.2f}")
>> Unigram Q-Gram Similarity:0.32
>> Bigram Q-Gram Similarity:0.15
All functions will perform pytextdist.preprocessing.phrase_preprocessing
to clean the input strings and convert them to a list of tokens.
-
When grain="char" - remove specific characters from the string and convert it to a list of characters
The following boolean parameters control what characters to remove/change from the string (all True by default):
- ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
- ignore_space: whether to remove all space
- ignore_numeric: whether to remove all numeric characters
- ignore_case: whether to convert all alpha charachers to lower caseExample:
from pytextdist.preprocessing import phrase_preprocessing before = 'AI Top-50' after = phrase_preprocessing(before, grain='char') print(after) >> ['a', 'i', 't', 'o', 'p']
-
When grain="word" - convert the string to a list of words and remove specific characters from the words
The string is firstly converted to a list of words assuming all words are separated by one space, then the following boolean parameters control what characters to remove/change from the string (all True by default):
- ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
- ignore_numeric: whether to remove all numeric characters
- ignore_case: whether to convert all alpha charachers to lower caseExample:
from pytextdist.preprocessing import phrase_preprocessing before = 'AI Top-50' after = phrase_preprocessing(before, grain='word') print(after) >> ['ai', 'top']
Functions under the vector similarity module will also perform pytextdist.preprocessing.ngram_counter
on the list return from pytextdist.preprocessing.phrase_preprocessing
.
-
Convert a list of tokens to a counter of the n-grams
The following parameter control the n to use for n-grams (1 by default):
- n: number of contiguous items to use to form a sequence
Example:
from pytextdist.preprocessing import phrase_preprocessing, ngram_counter before = 'AI Top-50 Company' after = phrase_preprocessing(before, grain='word') print(after) ngram_cnt_1 = ngram_counter(after, n=1) print(ngram_cnt_1) ngram_cnt_2 = ngram_counter(after, n=2) print(ngram_cnt_2) >> ['ai', 'top', 'company'] >> Counter({'ai': 1, 'top': 1, 'company': 1}) >> Counter({'ai top': 1, 'top company': 1})