Skip to content

Latest commit

 

History

History
76 lines (52 loc) · 7.41 KB

README.org

File metadata and controls

76 lines (52 loc) · 7.41 KB

Comparing Damerau-Levenshtein Implementations

Introduction

This repository contains three implementations of Damerau-Levenshtein string distance computation in C++ 17.

The Program

This package builds a program imaginatively named dl. It is installed via the usual autotools incantations. dl will compute Damerau-Levenshtein distance over a corpus of input strings provided on the command line with one of three algorithms (on which more below). For performance benchmarking, you can specify that the program should run over the corpus repeatedly and/or print timing information on completion. Say dl --help for full usage.

Discussion

The Damerau-Levenshtein distance between two strings A & B is the minimal number of insertions, deletions, single-character changes & transpositions needed to transform A into B (e.g. “act” -> “cat” -> “cart”, so the D-L distance between “act” & “cart” is two). In his original paper [1] Damerau claimed that 80% of the errors in the system which gave rise to his work could be accounted for by one of these four errors.

While it’s a popular algorithm for fuzzy searching, there doesn’t seem to be a “go-to” implementation out there in C/C++ (it is at the time of this writing a feature request for the boost string library). There are several others on Github, but the ones I examined lacked both documentation on the algorithms used as well as test suites.

This repo contains implementations of what I understand to be the three major algorithms implementing Damerau-Levenshtein distance in C++.

Lowrance & Wagner’s Recursive Relation

This is the algorithm laid out in Wikipedia (again at the time of this writing). It is due to Lowrance & Wagner [2] and seems to be the most widely known. In their paper, they prove that under a few (reasonable) conditions, a simple |A| x |B| recurrence relation is sufficient to compute the D-L distance between strings A & B. Their algorithm runs in O(|A|*|B|) time & space.

Ukkonen

Ten years later, Ukkonen substantially improved the performance of this calculation [3]. His paper contained two major advancements. He proved (again under conditions) that in order to compute the D-L distance, one need only compute the recurrence relation in a fairly tight band around its main diagonal (substantially decreasing the number of operations required). Next, he moved from the primary recurrence relation of Lowrance & Wagner to a dual problem of computing f(k,p) which is defined as the maximal index i on diagonal k for which the edit distance between A(1..i) & B(1..j) is p; this doesn’t reduce the time complexity but does reduce that of space. His algorithm runs in O(s*min(|A|,|B|)) (where s is the D-L distance between A & B) and space O(min(s,|A|,|B|))).

Berghel & Roach

Eleven years on, Berghel & Roach improved Ukkonen’s algorithm by deriving even tighter bounds on the region of the region of the recurrence relation that needs to be computed, leading to (in their application) a 42% speedup over Ukkonen. If s is the edit distance from A to B, and m & n their respective string lengths, and WLOG n >= m, define p to be 1/2s - 1/(2*(n-m)). Then the worst-case running time of this algorithm is O(n*p) and the expected running time is O(n+(p*s)).

After reading this paper, I spent some time searching the literature for references to it (looking for further improvements). I didn’t find any directly-related improvements. Later references either cited it as the best known to the author(s) for computing Damerau-Levenshtein distance, or as a point of interest in developing different string metrics.

Results

>$: uname -a
Linux saradoc 5.3.0-7625-generic #27~1576337002~19.10~bc3488b-Ubuntu SMP Sat Dec 14 18:31:03 UTC  x86_64 x86_64 x86_64 GNU/Linux
>$: cd test && make timing-tests
...
Lowrance & Wagner: processing took 5500ms
Ukkonen: processing took 1931ms
Berghel & Roach: processing took 1175ms

What do you know? The speed-up from Ukkonen to Berghel & Roach was 39.15%, very close to the promised 42% all those years ago.

More recent results:

$> date
Sun Jul 18 16:21:21 PM PDT 2021
$> uname -a
Linux bree 5.12.14-arch1-1 #1 SMP PREEMPT Thu, 01 Jul 2021 07:26:06 +0000 x86_64 GNU/Linux
$> cd test && make timing-tests
...
Lowrance & Wagner: processing took 7361ms
Ukkonen: processing took 1957ms
Berghel & Roach: processing took 1049ms

In my benchmarking, I’ve found it important to avoid the use of std::vector, which surprises me. I’ve instead just used flat arrays & done the two-dimensional indexing manually.

All the other references check for common prefixes before starting the algorithm, which I haven’t tested, yet.

References

  1. Fred J. Damerau, A Technique for Computer Detection and Correction of Spelling Errors, Communications of the ACM, 7 (1964) No. 3, 171-176.
  2. Roy Lowrance and Robert A. Wagner, An Extension of the String-to-String Correction Problem, Journal of the Association for Computing Machinery, 22 (1975), No 2, 177-183.
  3. Esko Ukkonen, Algorithms for Approximate String Matching, Information and Control, 64 (1985) 100–118.
  4. Hal Berghel and David Roach, An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm, ACM Transactions on Information Systems, 14 (1996) No. 1, 94-106.