Prevent pymupdf4llm from removing hyphens at the end of lines #141
Replies: 2 comments 1 reply
-
Hm, actually, quite some effort has been invested to recognize and resolve hyphenation. |
Beta Was this translation helpful? Give feedback.
-
Sorry, the problem is the other way around. The hyphens are normally removed correctly so that the separated words are joined correctly. However, in the multi-column document, I have experienced that the words at the end of the line that end without a hyphen are merged with the word in the new line instead of a space being added there. I have replaced the following line without success: But this did not solve the problem. Because then there are spaces where there were hyphens. What can be done? |
Beta Was this translation helpful? Give feedback.
-
I tried pymupdf4llm for multi-column PDF documents. It did a good job. But in the extracted text the hyphens are missing at the end of the line. But these are needed to remove the paragraphs and connect the word parts that belong together in order to be able to search the text document properly.
What would I have to change in pymupdf4llm so that hyphens are not removed?
Beta Was this translation helpful? Give feedback.
All reactions