Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horrible failure with large documents #62

Closed
3 tasks
mikegerber opened this issue Dec 8, 2021 · 23 comments
Closed
3 tasks

Horrible failure with large documents #62

mikegerber opened this issue Dec 8, 2021 · 23 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mikegerber
Copy link
Member

mikegerber commented Dec 8, 2021

@stweil reported in Gitter:

Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

@cneud also submitted a large document (a newspaper page).

  • Investigate why the new version uses even more memory
  • Consider falling back to more efficient algorithms if necessary
  • Consider a regression test for this
@mikegerber mikegerber self-assigned this Dec 8, 2021
@mikegerber mikegerber added the bug Something isn't working label Dec 8, 2021
@mikegerber
Copy link
Member Author

I've asked @stweil to submit the text, as I am curious why it's using more memory.

@stweil
Copy link
Contributor

stweil commented Dec 8, 2021

I'll add them here soon.

@stweil
Copy link
Contributor

stweil commented Dec 8, 2021

I used this command (use links to get texts and result):

dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630

@stweil
Copy link
Contributor

stweil commented Dec 8, 2021

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

@maxbachmann
Copy link
Contributor

@mikegerber I am not familiar with dinglehopper, but I assume the editops calculation requires pretty much memory. For long texts I currently create a levenshtein matrix of len(s1) * len(s2) * 32 Bit. Since the text files both have around 110k characters this alone should use around 45 Gb of memory.

The previous implementation used np.zeros((m + 1, n + 1), np.int), which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved. This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.

An improved version could make use of bitparallelismus to improve the performance and only store bitvectors of the deltas between the matrix cells based on https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf.
Performance wise this would allow the implementation to calculate the matrix cells for 64 characters at once.
Memory wise this should only require me to store the vertical positive delta vector, the horizontal positive delta vector
and the diagonal zero delta vector and therefore only require around len(s1) * len(s2) * 3 Bit to store the matrix. So in the example above only around 4.2 Gb instead of 45 Gb.

Additionally it is possible to ignore some parts of the matrix which are not relevant, since they can not be on the optimal path, which in the best case (two strings of mostly similar length, which is the case in this usage) should allow the implementation to skip 25% of the matrix. This could further reduce the memory usage to 3.15 Gb.

@mikegerber
Copy link
Member Author

The previous implementation used np.zeros((m + 1, n + 1), np.int), which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved.
This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.

I will have a look into this in the next days, I have a hunch that it takes twice at much memory because of a memory leak. Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.

Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.

¹ or one of the alignments with the shortest possible distance, to be more precise

@mikegerber
Copy link
Member Author

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

I've opened #63 for this!

@mikegerber
Copy link
Member Author

I used this command (use links to get texts and result):

dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630

Mirrored here:

@maxbachmann
Copy link
Contributor

maxbachmann commented Dec 8, 2021

Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.

Yes rapidfuzz should free the matrix (and as far as I am aware it does). Note that the CER calculation does not require much memory, since it only has to store the last matrix row. Only the character alignment requires the whole matrix

Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.

I already use this implementation to calculate the edit distance, just not when retrieving the editops (to keep the initial implementation simple).

@mikegerber
Copy link
Member Author

@stweil Do you know why there are duplicates lines in the texts? It could be another bug in dinglehopper's text extraction. Would it be possible to get the PAGE versions of GT and OCR to check? (If necessary, DM me on Gitter and send it privately if that's a concern.)

@stweil
Copy link
Contributor

stweil commented Dec 8, 2021

The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.

@mikegerber
Copy link
Member Author

mikegerber commented Dec 9, 2021

The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.

While it is a problem that dinglehopper has issues with larger texts (i.e. newspaper pages), it is also best to feed it smaller texts if possible like it seems in this case. Alignment is O(length(GT) x length(OCR)).

Is this your use case?

  • Compute CER and give a visualization for all text lines (line GT vs line OCR)
  • Aggregate the CER over all text lines

If so, there are better ways to do it than concatenating the input. I can imagine implementing a special mode to read directories of line texts and summarize the result into one page and one global CER. Could you describe your input before the concatenation? (i.e. "one directory with *.gt.txt line texts and one directory with *.ocr.txt line texts with the same prefix")

@stweil
Copy link
Contributor

stweil commented Dec 9, 2021

Here I have one directory with *.gt.txt lines and several directories with OCR results *.txt which where produced with different software / models / process parameters. Each dinglehopper run should compare the GT directory with one of the OCR results directories. And yes, the file prefixes are the same.

But we also have the newspaper page use case.

@mikegerber
Copy link
Member Author

But we also have the newspaper page use case.

Yes two problems with two distinct solutions :)

@maxbachmann
Copy link
Contributor

maxbachmann commented Dec 11, 2021

@mikegerber I implemented the concept I described above (appears to work, but still needs more testing and some cleanup): rapidfuzz/rapidfuzz-cpp#58

  • It reduces the memory usage from 32 bit to 3 bit per cell (around 10x improvement)
  • It significantly improves the performance for all strings especially for long ones, since it reduces the time complexity from O(N*M) to O([N/64]*M). E.g. when using two strings with 20k characters I achieved around a 20x improvement in runtime.

Edit: I successfully fuzz tested the new implementation.
The improved version is available in v1.9.0: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.0

@maxbachmann
Copy link
Contributor

maxbachmann commented Dec 13, 2021

@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:

(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:22.97 real,	16.55 user,	7.33 sys,	5013484 mmem

So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.

@mikegerber
Copy link
Member Author

@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:

(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:22.97 real,	16.55 user,	7.33 sys,	5013484 mmem

So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.

This is great news 😍 I think it was a great decision to use rapidfuzz as the backend library for dinglehopper - all the features I had wished for and with great support and improvements from you!

I'll be on vacation starting Thursday and I'll keep this issue open until I have tested this thoroughly (after the vacation). But for now I've bumped the dependency to rapidfuzz >= 1.9.1.

@maxbachmann
Copy link
Contributor

maxbachmann commented Jul 24, 2022

@mikegerber I finally came around to implement editops for long sequences using a combination of Hirschbergs algorithm and the current algorithm. It splits the problems into subproblems until it is relatively small (around 2k characters) and then solves it using the existing bitparallel algorithm.
This reduces memory usage to O(N) and since it jumps around less in memory it improves performance for long sequences as well.

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:06.40 real,	3.65 user,	2.73 sys,	3371976 mmem

improves to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:05.78 real,	3.86 user,	1.91 sys,	92228 mmem

@mikegerber
Copy link
Member Author

improves to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:05.78 real,	3.86 user,	1.91 sys,	92228 mmem

Absolutely fantastic! I'll depend on the new version when it gets a release version!

@maxbachmann
Copy link
Contributor

I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉

@mikegerber
Copy link
Member Author

I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉

Yeah I think so too, just need to test it again!

@mikegerber mikegerber added this to the 1.0 milestone Mar 2, 2023
@mikegerber
Copy link
Member Author

Using @stweil's example:

% /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
        0:04.15 real,   5.00 user,      1.35 sys,       76916 mmem

I've also tested files that @cneud gave which exploded to 40 GB memory usage and the system would swap. Now they run in less than a second!! I'll add them to the test suite.

@mikegerber
Copy link
Member Author

0fd4ea1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants