scanned pdf #95

88arvin · 2023-02-02T07:34:11Z

88arvin
Feb 2, 2023

How can the tabular data in the scanned PDF be converted to a csv?

JorjMcKie · 2023-02-02T13:12:55Z

JorjMcKie
Feb 2, 2023
Maintainer

You must of course OCR the page first. After that, you can extract text (almost) like normal.

You still need external information to find the position (wrapping rectangle) of the table. This includes but is certainly not limited to things like:

you may know, that the table starts below some specific text string, and ends before some other specific text string
you may kow, that no text exists to the left and the right sides of the table
...

Be aware, that any drawings like lines around single table cells will not be recognizes by (most of) the OCR program - they don't exist in the OCR output.
So you must find a way to identify column borders yourself. This is not trivial, especially if cell content is not aligned in a particular way.
All this becomes much easier if you can provide "external" knowledge to the script you must write; something like "column 3 is text left-aligned, column 4 is numeric righ-alinged" and so on.

I hope my reasoning has become clear.

Once you have identified column and row borders, things are easy: both informations are lists of float values, which can be used to make rectangles, each representing a table cell. You can extract the text inside a rectangle by using text extraction with the "clip" parameter.

9 replies

chanpreet90 Feb 4, 2023

Still getting the same error :(

JorjMcKie Feb 4, 2023
Maintainer

@chanpreet90 ket me have the exception output please

chanpreet90 Feb 4, 2023

JorjMcKie Feb 4, 2023
Maintainer

you did not show the complete output (right hand side missing), but apparently package paddleocr uses PyMuPDF and contains the deleted attribute name.

Confirmed:
This package uses PyMuPDF and contains a wrong dependency info in its requirements text file: PyMuPDF<1.21.0. It should be <1.20.0.

So this is my recommendation:

Submit a bug issue to the package
Either downlevel your PyMuPDF to the latest 1.19.* - or go into you PyMuPDF installation folder and change __init__.py and replace lines 481/482 by just one line restore_aliases().

This will cause PyMuPDF to accept the old names again.

88arvin Feb 6, 2023
Author

Now, it's functional. Much appreciation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scanned pdf #95

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

scanned pdf #95

88arvin Feb 2, 2023

Replies: 1 comment · 9 replies

JorjMcKie Feb 2, 2023 Maintainer

chanpreet90 Feb 4, 2023

JorjMcKie Feb 4, 2023 Maintainer

chanpreet90 Feb 4, 2023

JorjMcKie Feb 4, 2023 Maintainer

88arvin Feb 6, 2023 Author

88arvin
Feb 2, 2023

Replies: 1 comment 9 replies

JorjMcKie
Feb 2, 2023
Maintainer

JorjMcKie Feb 4, 2023
Maintainer

JorjMcKie Feb 4, 2023
Maintainer

88arvin Feb 6, 2023
Author