Replies: 5 comments 2 replies
-
First of all, it's not a bug, but a feature.
Shor answer is "no". # old_output = previous document version
# new_output = current document
assert old_output.xref_length() == new_output.xref_length() # same number of PDF objects
for xref in range(1, old_output.xref_length()): # same object definition for each xref
assert old_output.xref_object(xref, compressed=True) == new_output.xref_object(
xref, compressed=True
)
assert old_output.xref_get_keys(-1) == new_output.xref_get_keys(-1) # same PDF keys in trailer If all of the above assertions work, there is an overwhelming probability that it is the same file. |
Beta Was this translation helpful? Give feedback.
-
Another option might be to locate the start of the PDF trailer in old and new versions (via Python |
Beta Was this translation helpful? Give feedback.
-
I managed to fix issue with ID in trailer. But in some documents I also have structures like |
Beta Was this translation helpful? Give feedback.
-
This is not done in PyMuPDF code. I remember I did that sort of thing years ago, but not in any recent version. |
Beta Was this translation helpful? Give feedback.
-
Hopefully I'm not bumping up this thread so much. But looks like we are having the same issue. I started using pymupdf for sorting pdfs and wrote a unit test for it to check if the output matches my expected page order. Realized that the library is generating different IDs for each save and I had to remove that from my test files via: @pytest.mark.parametrize(
"test_file,sorted_test_file,test_file_type",
[
("example1", "example1_sorted", 1),
],
)
def test_sort_pdf_by_postcode(test_file, sorted_test_file, test_file_type):
with (
open(f"tests/resources/{test_file}.pdf", "rb") as test_file,
open(f"tests/resources/{sorted_test_file}.pdf", "rb") as expected_file,
):
sorted_file = sort_pdf_by_postcode(test_file, test_file_type)
# Pymupdf adds an ID to the PDF file, which is random and changes every time the file is saved.
# Remove that and compare the files.
sorted_file = re.sub(r"ID\[.*?\]", "[]", str(sorted_file.read()))
expected_file = re.sub(r"ID\[.*?\]", "[]", str(expected_file.read()))
assert sorted_file == expected_file |
Beta Was this translation helpful? Give feedback.
-
I am trying to get some sort of reproducible builds. I have pipeline to process pdf docs (split on separate pages, process them one by one, merge everything back in single document).
But if I send same file several times through pipeline I get output files with different hashes.
Some observations:
Is there any way to get exactly the same files after each run?
Beta Was this translation helpful? Give feedback.
All reactions