Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

made fixes for really broken but readable files. #32

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mlaukala
Copy link

@mlaukala mlaukala commented Nov 6, 2017

Can now open and parse files with an incorrect startxref and incorrect stream lengths. Note that adobe acrobat x 10.0.0 will open these files and prompt for save when closed.

When the startxref is incorrect, it looks for the 'trailer' symbol and uses that trailer. Otherwise the xref table is not rebuilt. When the trailer is found, it then parses through the entire file and records the location of each object and places a new PdfReference inside of the PdfCrossReferenceTable.

Will also attempt to correct invalid stream lengths. After the stream length is pulled from the object, we first check for an incorrect 'endstream' symbol. If the 'endstream' symbol is not present where expected, we then look for the next valid 'endstream' symbol after the 'startstream' symbol. We use the 'endstream' symbol index and set the length of the stream.

Note 1: No implementation for a pdf file with a compressed trailer object yet.
Note 2: Not tested with versioned files and will still probably fail.
Note 3: On invalid stream length, should probably check 1k chunks of data for 'endstream'. Currently checks within the invalid length and if not found, loads the rest of the file and checks again.

@mlaukala
Copy link
Author

mlaukala commented Nov 6, 2017

ScanNextToken() will usually crash if it finds itself inside of a stream due to unexpected characters. Currently, my added function IsValidXref() uses ScanNextToken() to determine if the next token is an 'xref' token. This is used to determine if a fix needs to be made. However, if startxref is wrong, the _lexer.Position could be inside of a stream which will more than likely cause an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants