-
So I am using the get_text("blocks") function to get the coordinates of different text blocks on a page, my problem is that sometimes the blocks are not precise and pyMuPdf creates the the whole page as one text block, so can anyone explain me why does that happens? whats the algorithm or design decision behind how blocks will be created? It will help me to know if I need to find something else or just tweak this. I looked into the code and I think that the actual implementation of creating blocks is in compiled binaries of C++. So please help me THank you |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
You are right: If you think you can do better (either in general or in special cases), you can always go down to single characters using |
Beta Was this translation helpful? Give feedback.
You are right:
All
.get_text()
variants are wrappers ofTextPage
methods, e.g.TextPage.extractBlocks()
. A TextPage is created by C-code of our base library, MuPDF.MuPDF contains an algorithm that creates an hierarchy of
blocks -> lines -> spans -> characters
based on heuristics which look at things like font, font size, rotation, vertical and horizontal proximity and more. You should look at the documentation here and here for more detail.If you think you can do better (either in general or in special cases), you can always go down to single characters using
page.get_text("rawdict")
orpage.get_texttrace()
and synthesize the words, lines, paragraphs yourself.