Replies: 1 comment 9 replies
-
You must of course OCR the page first. After that, you can extract text (almost) like normal. You still need external information to find the position (wrapping rectangle) of the table. This includes but is certainly not limited to things like:
Be aware, that any drawings like lines around single table cells will not be recognizes by (most of) the OCR program - they don't exist in the OCR output. I hope my reasoning has become clear. Once you have identified column and row borders, things are easy: both informations are lists of float values, which can be used to make rectangles, each representing a table cell. You can extract the text inside a rectangle by using text extraction with the "clip" parameter. |
Beta Was this translation helpful? Give feedback.
-
How can the tabular data in the scanned PDF be converted to a csv?
Beta Was this translation helpful? Give feedback.
All reactions