Skip to content

How does extract blocks works internally? #3826

Answered by JorjMcKie
samyak112 asked this question in Q&A
Discussion options

You must be logged in to vote

You are right:
All .get_text() variants are wrappers of TextPage methods, e.g. TextPage.extractBlocks(). A TextPage is created by C-code of our base library, MuPDF.
MuPDF contains an algorithm that creates an hierarchy of blocks -> lines -> spans -> characters based on heuristics which look at things like font, font size, rotation, vertical and horizontal proximity and more. You should look at the documentation here and here for more detail.

If you think you can do better (either in general or in special cases), you can always go down to single characters using page.get_text("rawdict")or page.get_texttrace() and synthesize the words, lines, paragraphs yourself.

Replies: 1 comment 9 replies

Comment options

You must be logged in to vote
9 replies
@JorjMcKie
Comment options

@JorjMcKie
Comment options

@samyak112
Comment options

@JorjMcKie
Comment options

@samyak112
Comment options

Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants