Find and remove watermarks in PDF file #1855
-
I am currently tring to use PyMuPDF to remove watermarks in PDF files. For example, I have a file like this: The code I used for extracting is like this: document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename) Also I tried to check the page's other attributes: no annotations, no links, no widgets. I have no idea how the mark is stored. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 13 replies
-
The watermark in you example file are stored as so-called marked-content 1. Determine presence of marked-content watermarksFirst standardize the page's page.clean_contents()
xref = page.get_contents()[0] # get xref of resulting /Contents object
cont = bytearray(page.read_contents()) # read the contents source as a (modifyable) bytearray
if cont.find(b"/Subtype/Watermark") > 0: # this will confirm a marked-content watermark is present
print("marked-content watermark present") 2. Remove marked-content watermarksAfter confirmation in previous step, we "edit" the source and remove all such definitions. Because of source standardization, we can rely on a predictable layout. Every watermark in your example looks like this: q
/Artifact <</Subtype/Watermark/Type/Pagination>> BDC
.573 .816 .314 rg
/Fm1 Do
Q
EMC "Fm1" is the first of those 10 Chinese characters in the green diagonal text. The green color is coded as while True:
i1 = cont.find(b"/Artifact") # start of definition
if i1 < 0: break # none more left: done
i2 = cont.find(b"EMC", i1) # end of definition
cont[i1-2 : i2+3] = b"" # remove the full definition source "q ... EMC"
doc.update_stream(xref, cont) # replace the original source
doc.ez_save("x.pdf") # save to new file |
Beta Was this translation helpful? Give feedback.
-
This new example indeed is no watermark at all. It technically is so-called "line art": elementary drawings of lines and curves forming Chinese letters. Your previous example also had these things, but there the drawings were coded inside separate PDF objects (Form XObjects) and then referenced by the Here, the drawings are made directly on the page. You can extract them (via |
Beta Was this translation helpful? Give feedback.
-
@Jason-XII Could you share your complete code to remove a watermark? |
Beta Was this translation helpful? Give feedback.
The watermark in you example file are stored as so-called marked-content
/Artifacts
.There is no direct, dedicated high-level function in PyMuPDF to deal with these object types.
But you can use PyMuPDF's low-level interface to locate and remove them if you follow a strict procedure.
1. Determine presence of marked-content watermarks
First standardize the page's
/Contents
objects. This will produce a predictable source code structure - and also repair any potential issues. There also will be left over only one such object.Then confirm the presence of this watermark type.