data_utils.py has a bug in extract_pdf_content #1034

cynthiajiangatl · 2024-08-01T22:02:26Z

Describe the bug
When there is an empty table in a pdf document, if layout model is used, extract_pdf_content will fail with "list index out of range" error.

Expected behavior
Empty table should be skipped.

Code fix needed
Add try catch and skip the empty table.

for table in form_recognizer_results.tables:
try:
table.spans[0]
except:
continue
table_offset = table.spans[0].offset
table_length = table.spans[0].length
if page_offset <= table_offset and table_offset + table_length < page_offset + page_length:
tables_on_page.append(table)

vkrd · 2024-08-05T18:42:45Z

Thanks for pointing this out, fixed in #1040

cynthiajiangatl added the bug Something isn't working label Aug 1, 2024

vkrd mentioned this issue Aug 5, 2024

Prevent ingestion failure on empty tables #1040

Merged

4 tasks

vkrd closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_utils.py has a bug in extract_pdf_content #1034

data_utils.py has a bug in extract_pdf_content #1034

cynthiajiangatl commented Aug 1, 2024

vkrd commented Aug 5, 2024

data_utils.py has a bug in extract_pdf_content #1034

data_utils.py has a bug in extract_pdf_content #1034

Comments

cynthiajiangatl commented Aug 1, 2024

vkrd commented Aug 5, 2024