Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-invalidate derived images when changing coords #639

Merged
merged 3 commits into from
Nov 3, 2020

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Oct 30, 2020

Fixes the issue described in chat for all processors based on core Python. (Not sure about bashlib processors...)

(I added a warning to the default GdsCollector about the AlternativeImage removal. If and when #576 lands, these should appear in OCR-D logging somehow. 🤞)

Thus, some workflows will work, while others will need to be re-defined.

As I said above, this still does not expose all bad workflows, as it depends whether processors make their image requirements explicit (feature_selector and feature_filter) or not. If they do, we'll at least get exceptions Found no AlternativeImage that satisfies all requirements when running invalid workflows. But if they don't, there might be strange errors (like when 3-dim integer arrays from raw RGB images are passed to algorithms that need 1-dim boolean arrays from binarization) or no errors at all (like when implicit internal binarization happens or denoising is simply missing).

@bertsky bertsky requested a review from kba October 30, 2020 23:41
- whenever overwriting `Border`'s `Coords` or `Coords/@points`,
  remove all the `Page`'s derived images with `cropped`
- whenever overwriting `Region`'s or `TextLine`'s or `Word`'s
  or `Glyph`'s `Coords` or `Coords/@points`,
  remove all its derived images
- whenever overwriting `Page`'s or `Region`'s `@orientation`,
  remove all its derived images with `deskewed`
- add a warning to the GdsCollector each time
@bertsky bertsky force-pushed the fix-page-invalidate-alternativeimages branch from bc81f87 to a8cd848 Compare October 30, 2020 23:48
- whenever overwriting `Page`'s `Border`,
  remove all its derived images with `cropped`
Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we should add unit tests though.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 2, 2020

we should add unit tests though.

Indeed. This is actually not that hard to do.

For example, in repo/assets/data/kant_aufklaerung_1784-complex/data,

  • take OCR-D-SEG-BLOCK-tesseract, parse a page and modify its Border or Border/Coords or Border/Coords/@points and verify that all /PcGts/Page/AlternativeImage are gone (because cropping was first in the workflow).
  • Or take OCR-D-SEG-PAGE-anyocr-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-tesseract-DESKEW-ocropy and modify its Page/@orientation and verify that all derived images with deskewed are gone.
  • Or take OCR-D-SEG-BLOCK-tesseract-CLIP-DESKEW-tesseract and modify some TextRegion/@orientation and verify that all derived images with deskewed are gone in that segment.
  • Or take OCR-D-SEG-BLOCK-tesseract-CLIP-DESKEW-tesseract and modify some TextRegion/Coords or TextRegion/Coords/@points and verify that all derived images are gone in that segment.

@kba kba merged commit a081107 into OCR-D:master Nov 3, 2020
@bertsky bertsky deleted the fix-page-invalidate-alternativeimages branch June 6, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants