Skipping OCR processing based on logical `mets:structMap` #192

mikegerber · 2022-02-22T12:58:40Z

From my and @bertsky's discussion at qurator-spk/eynollah#67:

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

It should be possible to skip pages with structMap types like spine or colour_checker. (@maria-federbusch supplied us at SBB with a list, I'll copy it in here.)
What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?

The text was updated successfully, but these errors were encountered:

mikegerber · 2022-02-22T14:14:05Z

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

That's a custom type used at SBB, invented by @maria-federbusch.

bertsky · 2022-02-22T15:21:33Z

missing colour_checker.

That's a custom type used at SBB, invented by @maria-federbusch.

I was surprised to see it in mets-mods2tei, but not in kitodo.presentation or dfg-viewer. Maybe you want to open a PR for that?

What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?

again, see previous discussion

bertsky · 2022-08-16T14:43:27Z

One might think of an additional CLI option, say -G, --page-type, matching mets:structMap[@TYPE="LOGICAL"]//mets:div/@TYPE of pages in that range of the mets:structLink (if any), perhaps even with //-prefixed regexes.

But practically, there are too many positive cases to include, and only a few fixed negative ones: cover_front, cover_back, binding, spine, privileges, note.

So maybe we should just recommend ignoring all physical pages belonging to these page ranges in the implementation (and implement that behaviour for all Pythonic and bashlib processors in core)?

M3ssman · 2022-10-07T06:55:21Z

Additionally, I do use the information from physical containers.
We have often custom labeled containers alike Leerseite or Colorchecker ( 🙂 ) on this area.

M3ssman · 2022-10-07T07:08:24Z

If Image has been skipped due logical / physical mismatch, there's no FULLTEXT existing, and nothing linked in the physical container, too.

bertsky · 2023-04-14T08:57:35Z

@M3ssman on the OCR-D Forum you said that you have a workflow to do page selection based on logical structmap externally (independent of OCR-D) – could you elaborate here?

M3ssman · 2023-04-14T10:50:03Z

It analyzes the METS and filters images by defined labels like the logical ones from DFG-structset like cover_front and cover_back and custom physical annotations like Colorchecker , Leerseite, Illustration and so on. For the later you are required to have this information present, for example it has been enriched by your digitization colleagues.
For rather small prints (<100 pages) this means saving 10% or more.

This relies on the fact, that each page is processed afterwards in separate ocrd-workspaces.
Only for images which do not match the blacklisted label are those workspaces created.
Afterwards only the existing OCR is enriched as FULLTEXT, leaving some pages empty. I did not experienced any drawbacks of this approach in the last half year.

Works also when creating new PDFs from resulting ALTO-Data using derivans tool.

To enhance this for complete ocrd-workspaces, one could probably combine this even with lazy-loading to don't download these images locally.

bertsky · 2023-04-14T13:37:57Z

ok, so in principle it's clear that if you use the split recipe (dividing up the METS into single-page workspaces to be processed in parallel), then it is easy to filter by logical page type. (Still, I was hoping for some concrete technical details.)

Getting back to the question how to do this with OCR-D: @kba, can you please weigh in (esp. whether we should do this with positive/negative filters on a new CLI option, or rather by implicit filtering in core, perhaps even configurable...)?

bertsky mentioned this issue Dec 6, 2023

idea: add page filtering OCR-D/ocrd-demo-mets-server#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skipping OCR processing based on logical `mets:structMap` #192

Skipping OCR processing based on logical `mets:structMap` #192

mikegerber commented Feb 22, 2022 •

edited

Loading

mikegerber commented Feb 22, 2022

bertsky commented Feb 22, 2022

bertsky commented Aug 16, 2022

M3ssman commented Oct 7, 2022

M3ssman commented Oct 7, 2022

bertsky commented Apr 14, 2023

M3ssman commented Apr 14, 2023

bertsky commented Apr 14, 2023

Skipping OCR processing based on logical mets:structMap #192

Skipping OCR processing based on logical mets:structMap #192

Comments

mikegerber commented Feb 22, 2022 • edited Loading

mikegerber commented Feb 22, 2022

bertsky commented Feb 22, 2022

bertsky commented Aug 16, 2022

M3ssman commented Oct 7, 2022

M3ssman commented Oct 7, 2022

bertsky commented Apr 14, 2023

M3ssman commented Apr 14, 2023

bertsky commented Apr 14, 2023

Skipping OCR processing based on logical `mets:structMap` #192

Skipping OCR processing based on logical `mets:structMap` #192

mikegerber commented Feb 22, 2022 •

edited

Loading