How does extract blocks works internally? #3826

samyak112 · 2024-08-30T05:16:22Z

samyak112
Aug 30, 2024

So I am using the get_text("blocks") function to get the coordinates of different text blocks on a page, my problem is that sometimes the blocks are not precise and pyMuPdf creates the the whole page as one text block, so can anyone explain me why does that happens? whats the algorithm or design decision behind how blocks will be created? It will help me to know if I need to find something else or just tweak this. I looked into the code and I think that the actual implementation of creating blocks is in compiled binaries of C++. So please help me

THank you

Answered by JorjMcKie

Aug 30, 2024

You are right:
All .get_text() variants are wrappers of TextPage methods, e.g. TextPage.extractBlocks(). A TextPage is created by C-code of our base library, MuPDF.
MuPDF contains an algorithm that creates an hierarchy of blocks -> lines -> spans -> characters based on heuristics which look at things like font, font size, rotation, vertical and horizontal proximity and more. You should look at the documentation here and here for more detail.

If you think you can do better (either in general or in special cases), you can always go down to single characters using page.get_text("rawdict")or page.get_texttrace() and synthesize the words, lines, paragraphs yourself.

View full answer

JorjMcKie · 2024-08-30T08:46:45Z

JorjMcKie
Aug 30, 2024
Maintainer

You are right:
All .get_text() variants are wrappers of TextPage methods, e.g. TextPage.extractBlocks(). A TextPage is created by C-code of our base library, MuPDF.
MuPDF contains an algorithm that creates an hierarchy of blocks -> lines -> spans -> characters based on heuristics which look at things like font, font size, rotation, vertical and horizontal proximity and more. You should look at the documentation here and here for more detail.

If you think you can do better (either in general or in special cases), you can always go down to single characters using page.get_text("rawdict")or page.get_texttrace() and synthesize the words, lines, paragraphs yourself.

9 replies

JorjMcKie Aug 30, 2024
Maintainer

If you do words = page.get_text("words") the you get a list of tuples like this one:
(x0, y0, x1, y1, "word1", ...).
The first 4 values are the bounding box coordinates of the string "word1" on the page. You can convert this to a formal pymupdf rectangle via wrect = pymupdf.Rect(words[i][:4]) for the i-th tuple.

In my terminology, a "word" is any string not containing spaces. So during extraction, encountering something like " something², " the respective tuple will return the bbox of this and item 4 will contain "something2,".

Hope I am clear.

JorjMcKie Aug 30, 2024
Maintainer

This special "words" extraction variant also supports a parameter "delimiters=". This lets you define additional word separators when needed. So if you use "delimiters=string.punctuation", then only the bbox and string for "something2" is returned.

samyak112 Aug 30, 2024
Author

So just to be clear , so that I don't feel "lucky" next time, get_text("words") returns me coordinates of each word, and it won't have accuracy problems like "blocks" where sometimes it creates block of entire page as one block. And this will happen because a word is defined as a sequence of characters without spaces. Correct?

Follow up question
Do you think it would be something contribute able if I write a new function which may create blocks more precisely maybe? Which will use get_text("words") under the hood, instead of using c++ the implementations to create blocks? Or is that something that people already tried?

Thanks for such a great support

JorjMcKie Aug 30, 2024
Maintainer

Yes - you got it right.

As per the second part, I am not sure it would be worth the effort.
In current times, we are more interested in determining "natural" reading sequences (whatever this may mean in non-left-to-right, top-to-bottom languages 🤷‍♂️). This is a major problem in PDF - and we do have almost complete solutions.
A second problem field is recognizing the page layout (WRT to e.g. multi-column text) - without using resource hogs like OCR.
A third problem is table recognition. We have a solution - which is imperfect as are all other solutions on the market too.

The standard block determination of MuPDF does not ignore white space text. On rare occasions, this can lead to blocks looking surprising at first sight. But mostly this can easily be overcome - few people really need the block and rather look at text spans to build up a usable page layout.

samyak112 Aug 30, 2024
Author

Okay wait regarding your second problem are you talking about research papers like layouts? Where the text in the bottom of one column continues at the top of second column? Because I actually was able to overcome it as I needed to get text in continuous reading format. So if you are talking about the same thing I would be happy to share that solution.

And regarding the table recognition problem, can you tell me what's the ideal output? I know am just a new guy and it's not really worth to explain me all that, so if you can just point me to some documentation where I can read about the ideal output, I would love to atleast try to solve it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does extract blocks works internally? #3826

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How does extract blocks works internally? #3826

samyak112 Aug 30, 2024

Replies: 1 comment · 9 replies

JorjMcKie Aug 30, 2024 Maintainer

JorjMcKie Aug 30, 2024 Maintainer

JorjMcKie Aug 30, 2024 Maintainer

samyak112 Aug 30, 2024 Author

JorjMcKie Aug 30, 2024 Maintainer

samyak112 Aug 30, 2024 Author

samyak112
Aug 30, 2024

Replies: 1 comment 9 replies

JorjMcKie
Aug 30, 2024
Maintainer

JorjMcKie Aug 30, 2024
Maintainer

JorjMcKie Aug 30, 2024
Maintainer

samyak112 Aug 30, 2024
Author

JorjMcKie Aug 30, 2024
Maintainer

samyak112 Aug 30, 2024
Author