Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping OCR processing based on logical mets:structMap #192

Open
mikegerber opened this issue Feb 22, 2022 · 8 comments
Open

Skipping OCR processing based on logical mets:structMap #192

mikegerber opened this issue Feb 22, 2022 · 8 comments

Comments

@mikegerber
Copy link

mikegerber commented Feb 22, 2022

From my and @bertsky's discussion at qurator-spk/eynollah#67:

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

  • It should be possible to skip pages with structMap types like spine or colour_checker. (@maria-federbusch supplied us at SBB with a list, I'll copy it in here.)
  • What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?
@mikegerber
Copy link
Author

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

That's a custom type used at SBB, invented by @maria-federbusch.

@bertsky
Copy link
Collaborator

bertsky commented Feb 22, 2022

missing colour_checker.

That's a custom type used at SBB, invented by @maria-federbusch.

I was surprised to see it in mets-mods2tei, but not in kitodo.presentation or dfg-viewer. Maybe you want to open a PR for that?

  • What should happen with skipped pages in the output? Empty PAGE or just omitted? What are the drawbacks of each approach?

again, see previous discussion

@bertsky
Copy link
Collaborator

bertsky commented Aug 16, 2022

One might think of an additional CLI option, say -G, --page-type, matching mets:structMap[@TYPE="LOGICAL"]//mets:div/@TYPE of pages in that range of the mets:structLink (if any), perhaps even with //-prefixed regexes.

But practically, there are too many positive cases to include, and only a few fixed negative ones: cover_front, cover_back, binding, spine, privileges, note.

So maybe we should just recommend ignoring all physical pages belonging to these page ranges in the implementation (and implement that behaviour for all Pythonic and bashlib processors in core)?

@M3ssman
Copy link

M3ssman commented Oct 7, 2022

Additionally, I do use the information from physical containers.
We have often custom labeled containers alike Leerseite or Colorchecker ( 🙂 ) on this area.

@M3ssman
Copy link

M3ssman commented Oct 7, 2022

If Image has been skipped due logical / physical mismatch, there's no FULLTEXT existing, and nothing linked in the physical container, too.

@bertsky
Copy link
Collaborator

bertsky commented Apr 14, 2023

@M3ssman on the OCR-D Forum you said that you have a workflow to do page selection based on logical structmap externally (independent of OCR-D) – could you elaborate here?

@M3ssman
Copy link

M3ssman commented Apr 14, 2023

It analyzes the METS and filters images by defined labels like the logical ones from DFG-structset like cover_front and cover_back and custom physical annotations like Colorchecker , Leerseite, Illustration and so on. For the later you are required to have this information present, for example it has been enriched by your digitization colleagues.
For rather small prints (<100 pages) this means saving 10% or more.

This relies on the fact, that each page is processed afterwards in separate ocrd-workspaces.
Only for images which do not match the blacklisted label are those workspaces created.
Afterwards only the existing OCR is enriched as FULLTEXT, leaving some pages empty. I did not experienced any drawbacks of this approach in the last half year.

Works also when creating new PDFs from resulting ALTO-Data using derivans tool.

To enhance this for complete ocrd-workspaces, one could probably combine this even with lazy-loading to don't download these images locally.

@bertsky
Copy link
Collaborator

bertsky commented Apr 14, 2023

ok, so in principle it's clear that if you use the split recipe (dividing up the METS into single-page workspaces to be processed in parallel), then it is easy to filter by logical page type. (Still, I was hoping for some concrete technical details.)

Getting back to the question how to do this with OCR-D: @kba, can you please weigh in (esp. whether we should do this with positive/negative filters on a new CLI option, or rather by implicit filtering in core, perhaps even configurable...)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants