Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata #6

Open
sandro-pasquali opened this issue Sep 12, 2024 · 2 comments
Open

metadata #6

sandro-pasquali opened this issue Sep 12, 2024 · 2 comments

Comments

@sandro-pasquali
Copy link

Hi, great library!

Is there a way to configure pdf reading behavior? I'd like to be able to get metadata on a file -- number of chapters, whether page is an image or not, for instance -- prior to processing it. And all other metadata possible around a pdf.

And generally I'd like to determine pages, and go single page by page with custom processing for each. Is there maybe an interface to your internal pdf reader that can be exposed?

Thanks!

@Balearica
Copy link
Contributor

And generally I'd like to determine pages, and go single page by page with custom processing for each. Is there maybe an interface to your internal pdf reader that can be exposed?

Can you provide more detail regarding your use-case? It is probably possible to provide more control over PDF text extraction, but I'm not sure what you mean by custom processing for each page. There is no pre-existing generic JavaScript interface for the PDF reader--the PDF reader build is specific to this project, so adding new features would require making changes.

I'd like to be able to get metadata on a file -- number of chapters, whether page is an image or not, for instance -- prior to processing it. And all other metadata possible around a pdf.

If "whether the page is an image" refers to our categorization of "text native" and "image native" PDFs, this is not a metadata field, but rather something that is determined after reading the text content of the document. These categories are more nuanced than simply looking for whether images or text exist. For example, PDFs that contain no images may be categorized as "image native" if the document contains no valid encoding to map between glyphs and characters (so text cannot be extracted directly), which is surprisingly common.

@sandro-pasquali
Copy link
Author

Thanks. The use case is really having the ability to 1) To know, prior to processing, how many pages will be processed, and 2) To process pages one by one, something like for(page in pages) { scribe(page) }.

I would like to be able to time the processing per-page (e.g. for logging), and I want to know if an image or not as images take a lot more time to process. So that's what I'm aiming for, essentially the ability to in a controlled way parse page-by-page.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants