Skip to content

ecomp-shONgit/string-distance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

String Distance Implementation JS / PYTHON / PHP

A set of (string) distance-messures functions written in JavaScript / Python / PHP. Every function could be called with two representations of text to compute a distance, no matter if a bag of words represenations or a ngram represenation, string represenation is used. The implementation is derived from

Elena Deza / Michel-Marie Deza "Encyclopedia of Distances", Elsevier Science, 2009 (http://www.uco.es/users/ma1fegan/Comunes/asignaturas/vision/Encyclopedia-of-distances-2009.pdf)

C´edric Villani, "Optimal transport, old and new", Springer, 2008 (http://elenaher.dinauz.org/B07D.StFlour.pdf)

Examples

http://ecomparatio.net/~khk/string-distance-master/test_strdist.php

http://ecomparatio.net/~khk/string-distance-master/test_strdist_rulesets.php

Functions

1. Weighted Levenshtein

INPUT:

  • s1 and s2 as representations,
  • Wv a weight for pairs in A and B,
  • Ws a list of 4 weights related to the operations substitution, insertion, deletion, exchange,

RETURN: Number of edited Letters / sum of editweights,

CALL: WLEV( A, B, Wv, Ws )

2. Damerau Levenshtein

INPUT:

  • a text representation s1 and s2,
  • Ws a list of 4 weights related to the operations substitution, insertion, deletion, exchange,

RETURN: sum of editweights,

CALL: LEVDAM( s1, s2, Wv )

3. Levenshtein

INPUT:

  • s1 and s2 text representations,
  • Ws a list of 4 weights related to the operations substitution, insertion, deletion,

RETURN: number of edits,

CALL: levenshtein( s1, s2, Wv )

4. Longest Common Subsequence

NOTE: sequnece is not substring, it is like sequencial but not next to eachother,

INPUT: vecA and vecB text representations,

RETURN: 0 (distant) and max(len(A),len(B)) (not distant),

CALL: LCS( vecA, vecB )

5. Longest Common Substring

NOTE: factor, sequential and next to each other members of a vector,

INPUT: vecA and vecB text representations,

RETURN: 0 (distant, nothing in common) and max(len(A),len(B)) (not distant),

CALL: LCF( vecA, vecB )

6. containednessLCS

NOTE: according to LCS the containedness of a in b or b in a,

INPUT: a and b text representations,

RETURN: 1 (contained) and 0 (not contained),

CALL: containednessLCS( a, b )

7. containednessLCF

NOTE: according to LCF the containedness of a in b or b in a,

INPUT: a and b text representations,

RETURN: 1 (contained) and 0 (not contained),

CALL: containednessLCF( a, b )

8. Longest Commen Prefix,

INPUT: vecA and vecB text representations,

RETURN: 0 (distant) and max(len(A),len(B)) (not distant),

CALL: LCP( vecA, vecB )

9. Bag Distance

NOTE: vecA/vecB is a bag is a sequencial, and next to eachother, redundant vector, aproximation of levensthein,

INPUT: vecA and vecB text representations,

RETURN: max(len(A),len(B)) (distant) and 0 (not distant),

CALL: bagdist( vecA, vecB )

10. Jaro Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf (distant) and 0.0 (not distant) (?),

CALL: JA( vecA, vecB )

11. Jaro-Winkler Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf (distant) and 0.0 (not distant) (?),

CALL: JAWI( vecA, vecB )

12. Baire Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: baire( vecA, vecB )

13. not Baire Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: notbaire( vecA, vecB )

NOTE: same as baire but uses LCF

14. Gen. Cantor Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: generalizedcantor( vecA, vecB )

15. not Gen. Cantor Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: notgeneralizedcantor( vecA, vecB )

NOTE: same as generalizedcantor but uses LPC

16. JaccardZwei

NOTE: inverted Jaccard

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: jaccardMASZzwei( vecA, vecB )

17. Jaccard Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: jaccardMASZ( vecA, vecB )

18. Cosine Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: cosineMASZ( vecA, vecB )

19. Quadratic Difference Distance

NOTE: vec A and B are arrays of ngrams, quadra diff is a messure taken from the haufigkeitsvektor of A and B

INPUT: vecA, vecB text represenations,

RETURN: Inf (distant) and 0.0 (not distant) (?),

CALL: quadradiffMASZ( vecA, vecB )

20. Dice Coefficent Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf/1 (distant) and 0.0 (not distant),

CALL: diceMASZ( vecA, vecB )

21. Marking Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf (distant) and 0.0 (not distant) (?),

CALL: markingmetric( vecA, vecB )

22. Set Difference Distance

INPUT: vecA, vecB text represenations,

RETURN: Inf (distant) and 0.0 (not distant) (?),

CALL: setdiffmetric( vecA, vecB )

Minimal Example

1 2 WLEV LEVDAM levenshtein LCS LCF containednessLCS containednessLCF LCP bagdist JA JAWI baire generalizedcantor jaccardMASZzwei jaccardMASZ cosineMASZ quadradiffMASZ diceMASZ markingmetric setdiffmetric
abcdefg abcdefg 0 0 0 7 7 1 1 7 0 0.4523809523809524 0.8357142857142857 0.125 0.0003354626279025119 1 0 1.1102230246251565e-16 0 0 0 0
abcdefg hijklm 7 7 7 0 0 0 0 0 7 7 4.6 1 0.36787944117144233 0 1 1 3.605551275463989 1 4.962844630259907 4.02535169073515
abcdefg Infinity Infinity Infinity 0 0 0 0 0 7 Infinity Infinity Infinity Infinity Infinity Infinity Infinity Infinity Infinity Infinity Infinity
abcdefg abc 4 4 4 3 3 0.42857142857142855 0.42857142857142855 3 4 0.6428571428571429 0.7857142857142858 0.25 0.018315638888734182 0.4285714285714286 0.5714285714285714 0.3453463292920228 2 0.4 2.0794415416798357 1.6094379124341003
abcdefg dfg 4 4 4 3 2 0.42857142857142855 0.2857142857142857 0 4 0.15873015873015872 0.49523809523809526 1 0.36787944117144233 0.4285714285714286 0.5714285714285714 0.3453463292920228 2 0.4 2.0794415416798357 1.6094379124341003
abcdefg acg 4 4 4 2 1 0.2857142857142857 0.14285714285714285 1 4 0.40079365079365076 0.6404761904761904 0.5 0.1353352832366127 0.4285714285714286 0.5714285714285714 0.3453463292920228 2 0.4 2.1972245773362196 1.6094379124341003
abcdefg cdef 3 3 3 4 4 0.5714285714285714 0.5714285714285714 0 3 0.39285714285714285 0.6357142857142857 1 0.36787944117144233 0.5714285714285714 0.4285714285714286 0.2440710539815456 1.7320508075688772 0.2727272727272727 1.6094379124341003 1.3862943611198906
abcdefg cdeui 4 4 4 3 3 0.42857142857142855 0.42857142857142855 0 4 0.3428571428571428 0.6057142857142856 1 0.36787944117144233 0.33333333333333337 0.6666666666666667 0.49290744716289014 2.449489742783178 0.5 3.332204510175204 2.70805020110221
abcdefg acihef 4 4 4 3 2 0.42857142857142855 0.2857142857142857 1 3 0.2896825396825397 0.5738095238095238 0.5 0.1353352832366127 0.4444444444444444 0.5555555555555556 0.38278660015163235 2.23606797749979 0.3846153846153846 3.4011973816621555 2.4849066497880004
TIME ING: 1.4200000017881393 0.905000003054738 0.6749999988824129 0.7299999948590994 0.5699999984353781 0.4899999927729368 0.41000000573694706 0.21000000089406967 0.35500000230968 0.6250000074505806 0.5399999991059303 0.25499999709427357 0.27000000327825546 0.8100000023841858 0.5399999972432852 0.9249999988824129 0.6900000013411045 0.41499999538064003 0.4449999947100878 0.4950000047683716

NOTE

PHP Version

If mbstring extension is installed, every string is treated as array. That gives better distance results than PHP levenshtein.

Python Version

Use Python3.x!

JS Version

Just tested with the latest Chrome and Forefox, not on server environment, no other interpreters tested. Drop a issue if needed.