-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horrible failure with large documents #62
Comments
I've asked @stweil to submit the text, as I am curious why it's using more memory. |
I'll add them here soon. |
I used this command (use links to get texts and result): dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630 |
Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why? |
@mikegerber I am not familiar with dinglehopper, but I assume the editops calculation requires pretty much memory. For long texts I currently create a levenshtein matrix of The previous implementation used An improved version could make use of bitparallelismus to improve the performance and only store bitvectors of the deltas between the matrix cells based on https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf. Additionally it is possible to ignore some parts of the matrix which are not relevant, since they can not be on the optimal path, which in the best case (two strings of mostly similar length, which is the case in this usage) should allow the implementation to skip 25% of the matrix. This could further reduce the memory usage to 3.15 Gb. |
I will have a look into this in the next days, I have a hunch that it takes twice at much memory because of a memory leak. Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing. Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it. ¹ or one of the alignments with the shortest possible distance, to be more precise |
I've opened #63 for this! |
Mirrored here: |
Yes rapidfuzz should free the matrix (and as far as I am aware it does). Note that the CER calculation does not require much memory, since it only has to store the last matrix row. Only the character alignment requires the whole matrix
I already use this implementation to calculate the edit distance, just not when retrieving the editops (to keep the initial implementation simple). |
@stweil Do you know why there are duplicates lines in the texts? It could be another bug in dinglehopper's text extraction. Would it be possible to get the PAGE versions of GT and OCR to check? (If necessary, DM me on Gitter and send it privately if that's a concern.) |
The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line. |
While it is a problem that dinglehopper has issues with larger texts (i.e. newspaper pages), it is also best to feed it smaller texts if possible like it seems in this case. Alignment is O(length(GT) x length(OCR)). Is this your use case?
If so, there are better ways to do it than concatenating the input. I can imagine implementing a special mode to read directories of line texts and summarize the result into one page and one global CER. Could you describe your input before the concatenation? (i.e. "one directory with *.gt.txt line texts and one directory with *.ocr.txt line texts with the same prefix") |
Here I have one directory with *.gt.txt lines and several directories with OCR results *.txt which where produced with different software / models / process parameters. Each dinglehopper run should compare the GT directory with one of the OCR results directories. And yes, the file prefixes are the same. But we also have the newspaper page use case. |
Yes two problems with two distinct solutions :) |
@mikegerber I implemented the concept I described above (appears to work, but still needs more testing and some cleanup): rapidfuzz/rapidfuzz-cpp#58
Edit: I successfully fuzz tested the new implementation. |
@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:
So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz. |
This is great news 😍 I think it was a great decision to use rapidfuzz as the backend library for dinglehopper - all the features I had wished for and with great support and improvements from you! I'll be on vacation starting Thursday and I'll keep this issue open until I have tested this thoroughly (after the vacation). But for now I've bumped the dependency to rapidfuzz >= 1.9.1. |
@mikegerber I finally came around to implement editops for long sequences using a combination of Hirschbergs algorithm and the current algorithm. It splits the problems into subproblems until it is relatively small (around 2k characters) and then solves it using the existing bitparallel algorithm.
improves to
|
Absolutely fantastic! I'll depend on the new version when it gets a release version! |
I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉 |
Yeah I think so too, just need to test it again! |
Using @stweil's example:
I've also tested files that @cneud gave which exploded to 40 GB memory usage and the system would swap. Now they run in less than a second!! I'll add them to the test suite. |
@stweil reported in Gitter:
@cneud also submitted a large document (a newspaper page).
The text was updated successfully, but these errors were encountered: