find_tables() 'cells' attributes format #3629

isaac-peter · 2024-06-26T18:18:25Z

isaac-peter
Jun 26, 2024

Hello,

The extract() method of find_tables() seems to extract using the following format: Each element is a table row, and each subelement is a column in that given row.

The cells attribute on the other hand produces a list where each element is a cell, and the subelement is the bbox.

How do I relate a given element in the cells attribute back to the cell in the extract() method? Said otherwise, how do I find the bbox of a given cell from the extract() method?

I don't believe the "Page" doc describes the structure of the extract() or cells attribute.

thanks!

Answered by JorjMcKie

Jun 27, 2024

This has been answered in #3587. E.g.

imglist = page.get_image_info()

# copy of the table's text content:
tab_text = tab.extract()[:]
# the table's cell bboxes as Rect objects:
tab_cells=[[pymupdf.Rect(c) for c in r.cells] for r in tab.rows]

Are 2 lists of lists with the same sizes and indexed as [row][col]. So the text in tab_text[row][col] has the cell coordinates tab_cells[row][col] (which is a Rect object).

View full answer

isaac-peter · 2024-06-26T18:21:06Z

isaac-peter
Jun 26, 2024
Author

After reviewing the doc again, I'm thinking that maybe I use the row attribute to get the cells bboxes, and then relate this back to the extract() structure I described above?

0 replies

JorjMcKie · 2024-06-27T12:53:05Z

JorjMcKie
Jun 27, 2024
Maintainer

This has been answered in #3587. E.g.

imglist = page.get_image_info()

# copy of the table's text content:
tab_text = tab.extract()[:]
# the table's cell bboxes as Rect objects:
tab_cells=[[pymupdf.Rect(c) for c in r.cells] for r in tab.rows]

Are 2 lists of lists with the same sizes and indexed as [row][col]. So the text in tab_text[row][col] has the cell coordinates tab_cells[row][col] (which is a Rect object).

1 reply

isaac-peter Jun 27, 2024
Author

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find_tables() 'cells' attributes format #3629

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

find_tables() 'cells' attributes format #3629

isaac-peter Jun 26, 2024

Replies: 2 comments · 1 reply

isaac-peter Jun 26, 2024 Author

JorjMcKie Jun 27, 2024 Maintainer

isaac-peter Jun 27, 2024 Author

isaac-peter
Jun 26, 2024

Replies: 2 comments 1 reply

isaac-peter
Jun 26, 2024
Author

JorjMcKie
Jun 27, 2024
Maintainer

isaac-peter Jun 27, 2024
Author