python-text-distance

A python implementation of a variety of text distance and similarity metrics.

Install
How to Use
Module
- Edit Distance
- Vector Similarity
Customize Preprocess

Install

Requirements: py>=3.3

Install Command: pip install pytextdist

How to use

The functions in this package takes two strings as input and return the distance/similarity metric between them. The preprocessing of the strings are included in the functions with default recommendation. If you want to change the preprocessing see Customize Preprocessing.

Modules

Edit Distance

By default functions in this module consider single character as the unit for editting.

Levenshtein Distance & Similarity: edit with insertion, deletion, and substitution

from pytextdist.edit_distance import levenshtein_distance, levenshtein_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = levenshtein_distance(str_a,str_b)
simi = levenshtein_similarity(str_a,str_b)
print(f"Levenshtein Distance:{dist:.0f}\nLevenshtein Similarity:{simi:.2f}")

>> Levenshtein Distance:3
>> Levenshtein Similarity:0.57

Longest Common Subsequence Distance & Similarity: edit with insertion and deletion

from pytextdist.edit_distance import lcs_distance, lcs_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = lcs_distance(str_a,str_b)
simi = lcs_similarity(str_a,str_b)
print(f"LCS Distance:{dist:.0f}\nLCS Similarity:{simi:.2f}")

>> LCS Distance:5
>> LCS Similarity:0.62

Damerau-Levenshtein Distance & Similarity: edit with insertion, deletion, substitution, and transposition of two adjacent units

from pytextdist.edit_distance import damerau_levenshtein_distance, damerau_levenshtein_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = damerau_levenshtein_distance(str_a,str_b)
simi = damerau_levenshtein_similarity(str_a,str_b)
print(f"Damerau-Levenshtein Distance:{dist:.0f}\nDamerau-Levenshtein Similarity:{simi:.2f}")

>> Damerau-Levenshtein Distance:3
>> Damerau-Levenshtein Similarity:0.57

Hamming Distance & Similarity: edit with substition; note that hamming metric only works for phrases of the same lengths

from pytextdist.edit_distance import hamming_distance, hamming_similarity

str_a = 'kittens'
str_b = 'sitting'
dist = hamming_distance(str_a,str_b)
simi = hamming_similarity(str_a,str_b)
print(f"Hamming Distance:{dist:.0f}\nHamming Similarity:{simi:.2f}")

>> Hamming Distance:3
>> Hamming Similarity:0.57

Jaro & Jaro-Winkler Similarity: edit with transposition

from pytextdist.edit_distance import jaro_similarity, jaro_winkler_similarity

str_a = 'sitten'
str_b = 'sitting'
simi_j = jaro_similarity(str_a,str_b)
simi_jw = jaro_winkler_similarity(str_a,str_b)
print(f"Jaro Similarity:{simi_j:.2f}\nJaro-Winkler Similarity:{simi_jw:.2f}")

>> Jaro Similarity:0.85
>> Jaro-Winkler Similarity:0.91

Vector Similarity

By default functions in this module use unigram (single word) to build vectors and calculate similarity. You can change n to other numbers for n-gram (n contiguous words) vector similarity.

Cosine Similarity

from pytextdist.vector_similarity import cosine_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = cosine_similarity(phrase_a, phrase_b, n=1)
simi_2 = cosine_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Cosine Similarity:{simi_1:.2f}\nBigram Cosine Similarity:{simi_2:.2f}")

>> Unigram Cosine Similarity:0.65
>> Bigram Cosine Similarity:0.38

Jaccard Similarity

from pytextdist.vector_similarity import jaccard_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = jaccard_similarity(phrase_a, phrase_b, n=1)
simi_2 = jaccard_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Jaccard Similarity:{simi_1:.2f}\nBigram Jaccard Similarity:{simi_2:.2f}")

>> Unigram Jaccard Similarity:0.47
>> Bigram Jaccard Similarity:0.22

Sorensen Dice Similarity

from pytextdist.vector_similarity import sorensen_dice_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = sorensen_dice_similarity(phrase_a, phrase_b, n=1)
simi_2 = sorensen_dice_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Sorensen Dice Similarity:{simi_1:.2f}\nBigram Sorensen Dice Similarity:{simi_2:.2f}")

>> Unigram Sorensen Dice Similarity:0.64
>> Bigram Sorensen Dice Similarity:0.36

Q-Gram Similarity

from pytextdist.vector_similarity import qgram_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = qgram_similarity(phrase_a, phrase_b, n=1)
simi_2 = qgram_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Q-Gram Similarity:{simi_1:.2f}\nBigram Q-Gram Similarity:{simi_2:.2f}")

>> Unigram Q-Gram Similarity:0.32
>> Bigram Q-Gram Similarity:0.15

Customize Preprocessing

All functions will perform pytextdist.preprocessing.phrase_preprocessing to clean the input strings and convert them to a list of tokens.

When grain="char" - remove specific characters from the string and convert it to a list of characters

The following boolean parameters control what characters to remove/change from the string (all True by default):

- ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
- ignore_space: whether to remove all space
- ignore_numeric: whether to remove all numeric characters
- ignore_case: whether to convert all alpha charachers to lower case

Example:
```
from pytextdist.preprocessing import phrase_preprocessing

before = 'AI Top-50'
after = phrase_preprocessing(before, grain='char')
print(after)

>> ['a', 'i', 't', 'o', 'p']
```
When grain="word" - convert the string to a list of words and remove specific characters from the words

The string is firstly converted to a list of words assuming all words are separated by one space, then the following boolean parameters control what characters to remove/change from the string (all True by default):

- ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
- ignore_numeric: whether to remove all numeric characters
- ignore_case: whether to convert all alpha charachers to lower case

Example:
```
from pytextdist.preprocessing import phrase_preprocessing

before = 'AI Top-50'
after = phrase_preprocessing(before, grain='word')
print(after)

>> ['ai', 'top']
```

Functions under the vector similarity module will also perform pytextdist.preprocessing.ngram_counter on the list return from pytextdist.preprocessing.phrase_preprocessing.

Convert a list of tokens to a counter of the n-grams

The following parameter control the n to use for n-grams (1 by default):

- n: number of contiguous items to use to form a sequence

Example:

from pytextdist.preprocessing import phrase_preprocessing, ngram_counter

before = 'AI Top-50 Company'
after = phrase_preprocessing(before, grain='word')
print(after)
ngram_cnt_1 = ngram_counter(after, n=1)
print(ngram_cnt_1)
ngram_cnt_2 = ngram_counter(after, n=2)
print(ngram_cnt_2)

>> ['ai', 'top', 'company']
>> Counter({'ai': 1, 'top': 1, 'company': 1})
>> Counter({'ai top': 1, 'top company': 1})

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
pytextdist		pytextdist
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-text-distance

Install

How to use

Modules

Edit Distance

Vector Similarity

Customize Preprocessing

About

Releases

Packages

Contributors 2

Languages

License

ywu94/python-text-distance

Folders and files

Latest commit

History

Repository files navigation

python-text-distance

Install

How to use

Modules

Edit Distance

Vector Similarity

Customize Preprocessing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages