metadata #6

sandro-pasquali · 2024-09-12T12:51:29Z

Hi, great library!

Is there a way to configure pdf reading behavior? I'd like to be able to get metadata on a file -- number of chapters, whether page is an image or not, for instance -- prior to processing it. And all other metadata possible around a pdf.

And generally I'd like to determine pages, and go single page by page with custom processing for each. Is there maybe an interface to your internal pdf reader that can be exposed?

Thanks!

Balearica · 2024-09-13T02:48:02Z

And generally I'd like to determine pages, and go single page by page with custom processing for each. Is there maybe an interface to your internal pdf reader that can be exposed?

Can you provide more detail regarding your use-case? It is probably possible to provide more control over PDF text extraction, but I'm not sure what you mean by custom processing for each page. There is no pre-existing generic JavaScript interface for the PDF reader--the PDF reader build is specific to this project, so adding new features would require making changes.

I'd like to be able to get metadata on a file -- number of chapters, whether page is an image or not, for instance -- prior to processing it. And all other metadata possible around a pdf.

If "whether the page is an image" refers to our categorization of "text native" and "image native" PDFs, this is not a metadata field, but rather something that is determined after reading the text content of the document. These categories are more nuanced than simply looking for whether images or text exist. For example, PDFs that contain no images may be categorized as "image native" if the document contains no valid encoding to map between glyphs and characters (so text cannot be extracted directly), which is surprisingly common.

sandro-pasquali · 2024-09-16T13:00:45Z

Thanks. The use case is really having the ability to 1) To know, prior to processing, how many pages will be processed, and 2) To process pages one by one, something like for(page in pages) { scribe(page) }.

I would like to be able to time the processing per-page (e.g. for logging), and I want to know if an image or not as images take a lot more time to process. So that's what I'm aiming for, essentially the ability to in a controlled way parse page-by-page.

Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata #6

metadata #6

sandro-pasquali commented Sep 12, 2024

Balearica commented Sep 13, 2024

sandro-pasquali commented Sep 16, 2024

metadata #6

metadata #6

Comments

sandro-pasquali commented Sep 12, 2024

Balearica commented Sep 13, 2024

sandro-pasquali commented Sep 16, 2024