Replies: 2 comments 2 replies
-
import pymupdf
# simulate font effects
doc = pymupdf.open()
page = doc.new_page()
text1 = "This is normal text."
text2 = "This text simulates italic."
text3 = "This text simulates bold."
mat = pymupdf.Matrix(1, 0, 0.5, 1, 0, 0)
p = pymupdf.Point(100, 100)
page.insert_text(p, text1)
p += (0, 20)
page.insert_text(p, text2, morph=(p, mat)) # use a matrix to morph the text
p += (0, 20)
page.insert_text(p, text3)
page.insert_text(p + (0.5, 0), text3) # write a second time with horizontal shift
doc.ez_save(__file__.replace(".py", ".pdf")) |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Thanks for your immediate reply. I've used various tools to parse the bug.pdf files such as pdf2docx, PyMuPDF, and PDF readers like PDFgear, but they all encounter the issue where the original text 'Bất động sản đầu tư' is falsely extracted as 'B t ñng s n ñu tư,' although some of the other words are extracted with the correct characters. On the other hand, when using the Aspose-Words package it detects 'Bất đống sản đầu tư' which is more correctly recognized though the font is slightly off. Do you have any suggestions to help accurately recognize characters while still using PyMuPDF? |
Beta Was this translation helpful? Give feedback.
-
Description of the bug
As far as I know, the page.get_text("dict") API can access to spans bounding boxes (which contain texts with the same font styles). However, some pdfs are somewhat strangely encoded, and it seems like PyMuPDF cannot detect spans for these files.
How to reproduce the bug
When i use the span detection with the below code for the
no_bug.pdf file, the code works just fine and the spans are detected relatively accurate (as shown in this image)
)
but when i changed into this pdf file (bug.pdf), it breaks (the bounding boxes are separated in a weird way, and the font styles bold/italic are also inaccurate):
Can someone tell me why does this happen?
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.10
Beta Was this translation helpful? Give feedback.
All reactions