Find and remove watermarks in PDF file #1855

Jason-XII · 2022-08-04T01:29:06Z

Jason-XII
Aug 4, 2022

I am currently tring to use PyMuPDF to remove watermarks in PDF files. For example, I have a file like this:
The PDF File Link or This Link if the previous link doesn't work
The green shape on the center of the page is the watermark. It's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files). Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.

The code I used for extracting is like this:

        document = fitz.open(self.input)
        for each_page in document:
            image_list = each_page.getImageList()
            for image_info in image_list:
                pix = fitz.Pixmap(document, image_info[0])
                png = pix.tobytes()  # return picture in png format
                if png == watermark_image:
                    document._deleteObject(image_info[0])
        document.save(out_filename)

Also I tried to check the page's other attributes: no annotations, no links, no widgets. I have no idea how the mark is stored.
So how do I find and remove the watermark using PyMuPDF?

Answered by JorjMcKie

Aug 4, 2022

The watermark in you example file are stored as so-called marked-content /Artifacts.
There is no direct, dedicated high-level function in PyMuPDF to deal with these object types.
But you can use PyMuPDF's low-level interface to locate and remove them if you follow a strict procedure.

1. Determine presence of marked-content watermarks

First standardize the page's /Contents objects. This will produce a predictable source code structure - and also repair any potential issues. There also will be left over only one such object.
Then confirm the presence of this watermark type.

page.clean_contents()
xref = page.get_contents()[0]  # get xref of resulting /Contents object
cont = bytearray(page.re…

View full answer

JorjMcKie · 2022-08-04T10:14:52Z

JorjMcKie
Aug 4, 2022
Maintainer

The watermark in you example file are stored as so-called marked-content /Artifacts.
There is no direct, dedicated high-level function in PyMuPDF to deal with these object types.
But you can use PyMuPDF's low-level interface to locate and remove them if you follow a strict procedure.

1. Determine presence of marked-content watermarks

First standardize the page's /Contents objects. This will produce a predictable source code structure - and also repair any potential issues. There also will be left over only one such object.
Then confirm the presence of this watermark type.

page.clean_contents()
xref = page.get_contents()[0]  # get xref of resulting /Contents object
cont = bytearray(page.read_contents())  # read the contents source as a (modifyable) bytearray
if cont.find(b"/Subtype/Watermark") > 0:  # this will confirm a marked-content watermark is present
    print("marked-content watermark present")

2. Remove marked-content watermarks

After confirmation in previous step, we "edit" the source and remove all such definitions. Because of source standardization, we can rely on a predictable layout. Every watermark in your example looks like this:

q
/Artifact <</Subtype/Watermark/Type/Pagination>> BDC
.573 .816 .314 rg
/Fm1 Do
Q
EMC

"Fm1" is the first of those 10 Chinese characters in the green diagonal text. The green color is coded as .573 .816 .314 rg.
You can use the following algorithm to remove each of these characters:

while True:
    i1 = cont.find(b"/Artifact")  # start of definition
    if i1 < 0: break  # none more left: done
    i2 = cont.find(b"EMC", i1)  # end of definition
    cont[i1-2 : i2+3] = b""  # remove the full definition source "q ... EMC"
doc.update_stream(xref, cont)  # replace the original source
doc.ez_save("x.pdf")  # save to new file

9 replies

JorjMcKie Sep 26, 2023
Maintainer

@MoritzImendoerffer can you give an idea of what that bytestring looks like in your case please?

val-fatale Sep 30, 2023

@JorjMcKie this saved my bacon, thank you so much for the detailed answer.

MoritzImendoerffer Oct 22, 2023

Hi @JorjMcKie ,

thank you for the quick response. I was able to solve it by detecting rotated text with a certain threshold for rotation and font size. I do detect rotation of each line like so:

def pymupdf_infer_angles(block):
    """
    Infer the rotation angle of a text block using the positions of its lines.
    """
    angles = []
    if block["type"] == 0:
        # For each line in the block
        for line in block["lines"]:    
                x1, y1, x2, y2 = line["bbox"]
                angle = math.atan2(y2 - y1, x2 - x1) * 180 / math.pi
                angles.append(angle)
    else:
        angles.append(None)

    return angles

I do obtain the blocks like so: page.get_text("dict", sort=sort_text)["blocks"]. Additionally, I do detect titles based on font sizes and convert them to markdown by counting all font sizes and using a rule based approach. While doing so, the thought came to my mind whether or not it might be good to train a classification model based on the output from page.get_text("dict", sort=sort_text)["blocks"] for all pages. I am pretty sure, I am not the first one who tried that. Do you think that might be a good idea?

JorjMcKie Oct 22, 2023
Maintainer

Yes, sure there are AI / ML models out there being trained to this.
Using them IMHO it is however a waste of resources if bread and butter tools have not been exploited yet.
I may look at this more from the conservative point of view: I detest having to use regular expressions when simple Pyhon string manipulation hasn't even been looked at.

Anyway:

A page contains its rotation value in page.rotation
A line within the stacked dictionaries of "dict"/"rawdict" text extractions has the "dir" key: line["dir"] = (cos, sin), a tuple of floats which contain the cosine / sine of the angle between the line and the x-axis. By using math.acos and math.asin the angle can be computed fairly easily.
Whenever line["dir"] != (1, 0) (= horizontal / parallel to x-axis), the line's bbox should be regarded as wrapping the quadrilateral within which the line's text occurs (and no longer directly the text itself). Recovering that quad is especially essential if text marker annotations shall be used - because these guys need to know what is top, bottom, left and right of the text. Otherwise they will look awkward or wrong. For recovering the quad, use fitz.recover_quad(), fitz.recover_line_quad() functions, which let you do the following sophisticated things:

dezoito Feb 5, 2024

@MoritzImendoerffer ,

Were you able to remove diagonal text, using the technique you wrote above:
I can use it to find blocks that are angled (say, near 45 degrees).

If I use the code below, however, a lot of the text gets deleted, and not just the diagonal watermarks:

    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        angles = pymupdf_infer_angles(block)
        # blocks can have more than one angle... consider only those
        # with exactly one angle in a range
        if len(angles) == 1:
            print(angles)
            if angles[0] and (30 <= angles[0] <= 60):
                # remove this
                for k, v in block.items():
                    print("----")
                    print(f"{k}: , {v}")
                changed += 1

                page.add_redact_annot(block["bbox"])
        page.apply_redactions()
        ...

I would appreciate any insights you, or anyone else reading this, might have.

Thanks

JorjMcKie · 2022-08-06T06:24:02Z

JorjMcKie
Aug 6, 2022
Maintainer

This new example indeed is no watermark at all. It technically is so-called "line art": elementary drawings of lines and curves forming Chinese letters.

Your previous example also had these things, but there the drawings were coded inside separate PDF objects (Form XObjects) and then referenced by the /Artifact mechanism. My script then removed those references to the Form XObjects.

Here, the drawings are made directly on the page. You can extract them (via page.get_drawings()), but you cannot remove them.

1 reply

Jason-XII Aug 10, 2022
Author

Ok，thank you for your work anyway😊

sisrfeng · 2023-11-05T11:38:27Z

sisrfeng
Nov 5, 2023

@Jason-XII Could you share your complete code to remove a watermark?

3 replies

Jason-XII Nov 26, 2023
Author

I think @JorjMcKie had already provided a .py file somewhere, maybe in another issue. I'm sure you can find it if you want, but the code snippet here is well enough for me :)

Jason-XII Dec 1, 2023
Author

I found a copy of that file in my computer. Here's the code:

"""
PyMuPDF demo utility
--------------------
Remove certain types of page watermarks from a PDF

Watermarks typically are used to declare the status of a page, like "DRAFT", "PRELIMINARY", "For Internal Use only", etc.

PDF supports multiple ways of applying watermarks to a page. Among them are
special annotation types and so-called pagination artifacts.

Removal of annotation-based watermarks is no problem with PyMuPDF: just
delete the respective annotation.
Pagination artifacts in contrast require using PyMuPDF's low-level features.

This script reads a PDF and removes any watermark artifacts on its pages,
that depend on images or Form XObjects. This happens by locating and deleting
the "Do" command within the watermark artifact declaration.

Usage: python remove-watermarks.py file.pdf

If watermarks were successfully removed, a new PDF 'file-nowm.pdf' is created
in the script's folder, else an information is printed on the console.
"""
import sys
import fitz


def process_page(page):
    """Process one page."""
    doc = page.parent  # the page's owning document
    page.clean_contents()  # clean page painting syntax
    xref = page.get_contents()[0]  # get xref of resulting /Contents
    changed = 0  # this will be returned
    # read sanitized contents, splitted by line
    cont_lines = page.read_contents().splitlines()
    for i in range(len(cont_lines)):  # iterate over the lines
        line = cont_lines[i]
        if not (line.startswith(b"/Artifact") and b"/Watermark" in line):
            continue  # this was not for us
        # line number i starts the definition, j ends it:
        j = cont_lines.index(b"EMC", i)
        for k in range(i, j):
            # look for image / xobject invocations in this line range
            do_line = cont_lines[k]
            if do_line.endswith(b"Do"):  # this invokes an image / xobject
                cont_lines[k] = b""  # remove / empty this line
                changed += 1
    if changed > 0:  # if we did anything, write back modified /Contents
        doc.update_stream(xref, b"\n".join(cont_lines))
    return changed


if __name__ == "__main__":
    doc = fitz.open(sys.argv[1])
    changed = 0  # indicates successful removals
    for page in doc:
        changed += process_page(page)  # increase number of changes
    if changed > 0:
        x = "s" if doc.page_count > 1 else ""
        print(f"{changed} watermarks have been removed on {doc.page_count} page{x}.")
        doc.ez_save(doc.name.replace(".pdf", "-nowm.pdf"))
    else:
        print("Nothing to change")

Hope this is helpful! :D

marcodkts Aug 22, 2024

Just sharing the result that i got with your help.

In my case, the document has a diagonal red text on the middle of the page and it was using some instructions to render it on the page, for my need i just removed the content of the bytearray.

document = fitz.open("pdf", pdf_bytes)
for page in document:
    doc = page.parent
    xref = page.get_contents()[0]

    decoded_content = bytearray(page.read_contents()).decode("latin-1")

    modified_content = []
    for line in decoded_content.splitlines():
        if any([wm in line for wm in watermark]):
            continue
        modified_content.append(line)

    doc.update_stream(xref, "\n".join(modified_content).encode("latin-1"))
document.save("result.pdf", garbage=4, deflate=True)

i spent some time to finally make it, so i just want to share it to help any other dev on the same situation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find and remove watermarks in PDF file #1855

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Find and remove watermarks in PDF file #1855

Jason-XII Aug 4, 2022

1. Determine presence of marked-content watermarks

Replies: 3 comments · 13 replies

JorjMcKie Aug 4, 2022 Maintainer

1. Determine presence of marked-content watermarks

2. Remove marked-content watermarks

JorjMcKie Sep 26, 2023 Maintainer

val-fatale Sep 30, 2023

MoritzImendoerffer Oct 22, 2023

JorjMcKie Oct 22, 2023 Maintainer

dezoito Feb 5, 2024

JorjMcKie Aug 6, 2022 Maintainer

Jason-XII Aug 10, 2022 Author

sisrfeng Nov 5, 2023

Jason-XII Nov 26, 2023 Author

Jason-XII Dec 1, 2023 Author

marcodkts Aug 22, 2024

Jason-XII
Aug 4, 2022

Replies: 3 comments 13 replies

JorjMcKie
Aug 4, 2022
Maintainer

JorjMcKie Sep 26, 2023
Maintainer

JorjMcKie Oct 22, 2023
Maintainer

JorjMcKie
Aug 6, 2022
Maintainer

Jason-XII Aug 10, 2022
Author

sisrfeng
Nov 5, 2023

Jason-XII Nov 26, 2023
Author

Jason-XII Dec 1, 2023
Author