-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping OCR processing based on logical mets:structMap
#192
Comments
That's a custom type used at SBB, invented by @maria-federbusch. |
I was surprised to see it in mets-mods2tei, but not in kitodo.presentation or dfg-viewer. Maybe you want to open a PR for that?
again, see previous discussion |
One might think of an additional CLI option, say But practically, there are too many positive cases to include, and only a few fixed negative ones: So maybe we should just recommend ignoring all physical pages belonging to these page ranges in the implementation (and implement that behaviour for all Pythonic and bashlib processors in core)? |
Additionally, I do use the information from physical containers. |
If Image has been skipped due logical / physical mismatch, there's no FULLTEXT existing, and nothing linked in the physical container, too. |
@M3ssman on the OCR-D Forum you said that you have a workflow to do page selection based on logical structmap externally (independent of OCR-D) – could you elaborate here? |
It analyzes the METS and filters images by defined labels like the logical ones from DFG-structset like This relies on the fact, that each page is processed afterwards in separate ocrd-workspaces. Works also when creating new PDFs from resulting ALTO-Data using derivans tool. To enhance this for complete ocrd-workspaces, one could probably combine this even with lazy-loading to don't download these images locally. |
ok, so in principle it's clear that if you use the split recipe (dividing up the METS into single-page workspaces to be processed in parallel), then it is easy to filter by logical page type. (Still, I was hoping for some concrete technical details.) Getting back to the question how to do this with OCR-D: @kba, can you please weigh in (esp. whether we should do this with positive/negative filters on a new CLI option, or rather by implicit filtering in core, perhaps even configurable...)? |
From my and @bertsky's discussion at qurator-spk/eynollah#67:
structMap
type
s likespine
orcolour_checker
. (@maria-federbusch supplied us at SBB with a list, I'll copy it in here.)The text was updated successfully, but these errors were encountered: