Reproducible "builds" #1040

hudvin · 2021-04-30T13:21:32Z

hudvin
Apr 30, 2021

I am trying to get some sort of reproducible builds. I have pipeline to process pdf docs (split on separate pages, process them one by one, merge everything back in single document).
But if I send same file several times through pipeline I get output files with different hashes.

Some observations:

ID in trailer section is different.
Probably timestamp in metadata.
Random suffixes in FormXob sections

Is there any way to get exactly the same files after each run?

JorjMcKie · 2021-04-30T14:34:30Z

JorjMcKie
Apr 30, 2021
Maintainer

First of all, it's not a bug, but a feature.
I am going to convert your post to a Dicsussions item.
By PDF specification, the second array entry of the trailer /ID key must be updated on every save. MuPDF uses a random number generator to compute both /ID items when saving to a new file. For incremental saves, just the second item is updated.
Observations 2. & 3. are probably caused by your own app.

Is there any way to get exactly the same files after each run?

Shor answer is "no".
But you can approximate an equality check along these lines:

    # old_output = previous document version
    # new_output = current document
    assert old_output.xref_length() == new_output.xref_length()  # same number of PDF objects
    for xref in range(1, old_output.xref_length()):  # same object definition for each xref
        assert old_output.xref_object(xref, compressed=True) == new_output.xref_object(
            xref, compressed=True
        )
    assert old_output.xref_get_keys(-1) == new_output.xref_get_keys(-1)  # same PDF keys in trailer

If all of the above assertions work, there is an overwhelming probability that it is the same file.

0 replies

JorjMcKie · 2021-04-30T14:59:40Z

JorjMcKie
Apr 30, 2021
Maintainer

Another option might be to locate the start of the PDF trailer in old and new versions (via Python .find()) and compute hashes up to these positions in the files.

0 replies

hudvin · 2021-04-30T17:59:55Z

hudvin
Apr 30, 2021
Author

I managed to fix issue with ID in trailer. But in some documents I also have structures like
/FormXob.541847dbcfabf322fd53ec1ad48a68ca Do
and looks like numerical part is also random. I mean after two runs I have two documents with different /FormXob.<some_id>
I will check everything step by step to provide additional details.

0 replies

JorjMcKie · 2021-04-30T19:09:27Z

JorjMcKie
Apr 30, 2021
Maintainer

/FormXob.541847dbcfabf322fd53ec1ad48a68ca Do

This is not done in PyMuPDF code. I remember I did that sort of thing years ago, but not in any recent version.

0 replies

burakcank · 2024-09-16T19:56:30Z

burakcank
Sep 16, 2024

Hopefully I'm not bumping up this thread so much. But looks like we are having the same issue.

I started using pymupdf for sorting pdfs and wrote a unit test for it to check if the output matches my expected page order. Realized that the library is generating different IDs for each save and I had to remove that from my test files via:

@pytest.mark.parametrize(
    "test_file,sorted_test_file,test_file_type",
    [
        ("example1", "example1_sorted", 1),
    ],
)
def test_sort_pdf_by_postcode(test_file, sorted_test_file, test_file_type):
    with (
        open(f"tests/resources/{test_file}.pdf", "rb") as test_file,
        open(f"tests/resources/{sorted_test_file}.pdf", "rb") as expected_file,
    ):
        sorted_file = sort_pdf_by_postcode(test_file, test_file_type)

        # Pymupdf adds an ID to the PDF file, which is random and changes every time the file is saved.
        # Remove that and compare the files.
        sorted_file = re.sub(r"ID\[.*?\]", "[]", str(sorted_file.read()))
        expected_file = re.sub(r"ID\[.*?\]", "[]", str(expected_file.read()))

        assert sorted_file == expected_file

2 replies

JorjMcKie Sep 17, 2024
Maintainer

You can prevent creation of new IDs via a Document.save() option! Why don't you use that instead?

burakcank Sep 17, 2024

Well, today I learned :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible "builds" #1040

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reproducible "builds" #1040

hudvin Apr 30, 2021

Replies: 5 comments · 2 replies

JorjMcKie Apr 30, 2021 Maintainer

JorjMcKie Apr 30, 2021 Maintainer

hudvin Apr 30, 2021 Author

JorjMcKie Apr 30, 2021 Maintainer

burakcank Sep 16, 2024

JorjMcKie Sep 17, 2024 Maintainer

burakcank Sep 17, 2024

hudvin
Apr 30, 2021

Replies: 5 comments 2 replies

JorjMcKie
Apr 30, 2021
Maintainer

JorjMcKie
Apr 30, 2021
Maintainer

hudvin
Apr 30, 2021
Author

JorjMcKie
Apr 30, 2021
Maintainer

burakcank
Sep 16, 2024

JorjMcKie Sep 17, 2024
Maintainer