Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement banded version of Levenshtein distance #178

Closed
maxbachmann opened this issue Jan 5, 2022 · 1 comment
Closed

implement banded version of Levenshtein distance #178

maxbachmann opened this issue Jan 5, 2022 · 1 comment

Comments

@maxbachmann
Copy link
Member

maxbachmann commented Jan 5, 2022

A banded version of the Levenshtein distance algorithm should be implemented as described in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf.
This would reduce the runtime of string_metric.levenshtein from O(N/64 * M) to O(score_cutoff/64 * M which could significantly improve the performance on long sequences. This could be used the reduce the memory usage of string_metric.levenshtein_editops as well, since it is only require to store the relevant band of the Levenshtein matrix instead of storing the whole matrix. E.g. for the case described in qurator-spk/dinglehopper#62 this could reduce the memory usage from around 110k**2 * 3bit = 4.2GB to 110k * 4k * 3bit = 160MB.

@maxbachmann
Copy link
Member Author

maxbachmann commented Jan 9, 2022

@mikegerber as a first step I updated the algorithm to use the alignment algorithm by Heikki Hirrö (A Note on Bit-Parallel Alignment Computation.

This improves memory usage by 33% from len(s1) * len(s2) * 3 bit to len(s1) * len(s2) * 2 bit. In addition this makes the algorithm around 10-20% faster.

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:22.97 real,	16.55 user,	7.33 sys,	5013484 mmem

improved to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:20.78 real,	14.23 user,	7.30 sys,	3353820 mmem

@maxbachmann maxbachmann added this to the v2.1 milestone Jan 24, 2022
@maxbachmann maxbachmann modified the milestones: v2.1, v2.2.0 Jun 29, 2022
@maxbachmann maxbachmann removed this from the v2.5.0 milestone Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant