New processor API #1240

bertsky · 2024-06-24T14:31:22Z

This is a first attempt at implementing #322 – without the actual error handling or parallel processing (so more or less refactoring and deprecation). It also enables fixing #579 in the process.

- add method `process_workspace(workspace)` as a replacement for passing `workspace` in the constructor and then calling `process` (implemented by subclasses): implement in the superclass - loop over input files - delegate processing to new method `process_page_file()` if possible - otherwise fall back to old `process()` outside of loop - download input files when needed if `self.download` - add method `process_page_file()` as single-page processing procedure on OcrdFiles: implement in the superclass for the most frequent/default use-case of - (multi-) image/PAGE input files - (single) PAGE output files - delegate to new method `process_page_pcgts()` if available - add PAGE processing metadata - set PAGE PcGtsId - handle `make_file_id` and `workspace.add_file` - add method `process_page_pcgts()` as single-page processing function on OcrdPage: to be implemented only by subclasses - constructor: add kwarg `download_files` controlling `self.download` (see above)

- implement `process_page_pcgts` with behaviour for `copy_files=False` - override superclass `process_page_file` with behaviour for `copy_files=True` - remove old `process` implementation

…load_files=False` in tests (because they are not actually in the filesystem)

…ocess() directly

bertsky · 2024-06-24T14:51:05Z

A couple of points worth discussing:

In contrast to the formulation in change API to get page-level parallelization everywhere #322, there is no single process_page now. We have too many cases to consider of what processors may need to do.
I decided to approach this by splitting up into process_page_file (taking an OcrdFile as input and adding the result to the workspace with nothing in return) and process_page_pcgts (taking a parsed OcrdPage as input and returning the modified OcrdPage): The latter must be overriden, the former can (but uses the latter in its default implementation).
That should be sufficiently general to be useful in adopting the new API, but let's see if we can indeed bring all processors into that paradigm. (For now I just provided the API adaptation for the builtin dummy processor.)
We have a design choice of providing a trunk definition in the superclass throwing NotImplementedError vs. not having the method at all. The latter can quickly be tested via hasattr, the former must be caught. For the pure PAGE processing function, I think it makes most sense to use the former.
To avoid the whole chdir business once and for all, I made process_workspace take the workspace as a mandatory argument, and deprecate passing a workspace in the constructor.
I did not sufficiently separate processing from non-processing context in the constructor, yet. The goal here was to first find a viable path in that direction.
Thus, I did add a setup method already, but the constructor still determines non-processing or processing context. I believe we should just use methods for that e.g. show_resources instead of taking that as a kwarg in the constructor. But it's an even larger change. (Hence the fixme comments about deprecating this.)
All of our actual (external) processors use workspace.download_file() on their input files, so that's worth including in the superclass behaviour. However, since most of our processor tests (in core) do not process any actual files, which I did not want to rewrite completely, I had to add an option download_files=False to avoid attempting that.

MehmedGIT

Looks like a great start.

src/ocrd/decorators/__init__.py

src/ocrd/processor/base.py

MehmedGIT · 2024-06-25T11:13:17Z

src/ocrd/processor/base.py

+        with pushd_popd(workspace.directory):
+            self.workspace = workspace
+            try:
+                # FIXME: add page parallelization by running multiprocessing.Pool (#322)


I hope the Pool workers would be also controllable by an environment variable to ease resource distribution in HPC environments.

That's the idea, exactly.

src/ocrd/processor/base.py

…l_filename

…images

… used in meta-data of resulting image

bertsky · 2024-07-05T14:54:05Z

BTW, I consider it bad practice that we always load the ocrd-tool.json in the inheriting constructor.

It should really be a class attribute. And assuming it is always placed in the same spot of the Python distribution (which is currently the case but not strictly required as our specs only require it in the repository root), one can even load it in the superclass (so nothing would have to be done in the inheriting code).

…oaded from v2)

… of AlternativeImageType

bertsky · 2024-09-12T11:54:48Z

Another backwards compatibility problem is behaviour for --overwrite: our v2 processors use their own Workspace.add_file calls, without setting force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE'. They expect the workspace to implicitly handle that via Workspace.overwrite_mode, which I removed in favour of the config mechanism.

Any ideas how we can best get a workaround for that (at least for some grace period)?

bertsky · 2024-09-12T13:03:52Z

Another backwards compatibility problem is behaviour for --overwrite: our v2 processors use their own Workspace.add_file calls, without setting force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE'. They expect the workspace to implicitly handle that via Workspace.overwrite_mode, which I removed in favour of the config mechanism.

Perhaps, in v3, instead of using force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE' whereever we can control, we should rather do if force or ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE' within add_file generally?

EDIT: I did exactly that – in hindsight, it also seems natural.

…up (also, simplify)

…t if fallback process() raises anything itself

…PUT=OVERWRITE|SKIP or disjoint --page-id)

…ameters (necessary for required params)

…o get right with required parameters, and now covered by wrapped Processor.verify() already)

…OUTPUT==OVERWRITE

…obally

bertsky added 14 commits June 10, 2024 23:49

deprecate Processor.process()

833dac7

fix OCR-D#274: no default -I / -O

3f4c7f9

workspace.download: fix typo in exception

d2b5df3

Processor: factor-out show_resource(), delegate to resolve_resource()

9827c4d

Processor: add setup(), run once in get_processor()

38fd4aa

ocrd_cli_wrap_processor: fix workspace arg (not a kwarg)

580988a

DummyProcessor: re-implement via new process_page_*

9714aab

- implement `process_page_pcgts` with behaviour for `copy_files=False` - override superclass `process_page_file` with behaviour for `copy_files=True` - remove old `process` implementation

run_processor: adapt to process→process_workspace

e5d4736

test DummyProcessor: adapt to new download default by setting `down…

809a01b

…load_files=False` in tests (because they are not actually in the filesystem)

test DummyProcessor: override process_workspace() by delegating to pr…

dfe7f8e

…ocess() directly

test builtin ocrd-dummy: adapt to consistent filename

1550668

test processor: adapt to input_file_grp required

75809b1

test processor: adapt to self.workspace only during run_processor

c429da5

bertsky requested review from kba, MehmedGIT and joschrew and removed request for kba and MehmedGIT June 24, 2024 14:51

MehmedGIT reviewed Jun 25, 2024

View reviewed changes

bertsky added 5 commits June 26, 2024 14:02

Workspace.save_image_file: add kwarg file_path for predetermined loca…

295cdb6

…l_filename

Processor.process_page_pcgts: add kwargs and allow returning derived …

e2cbcb9

…images

Workspace.save_image_file: save DPI metadata, too

20a6a1c

Workspace.image_from_*: annotate 'DPI' in result dict and ensure it's…

679ad85

… used in meta-data of resulting image

test_workspace: adapt to image_from_* DPI and add assertions

565a3d9

bertsky mentioned this pull request Jul 5, 2024

Fix processor resolve preset #1256

Merged

kba and others added 3 commits September 3, 2024 12:27

add typing, extend docs

65ab63c

Processor.verify: revert 5819c81 (we still have no defaults in json l…

73a395e

…oaded from v2)

Processor.process_page_file / OcrdPageResultImage: allow None instead…

3382ad9

… of AlternativeImageType

bertsky added 19 commits September 16, 2024 01:29

PcGts.Page.id / make_xml_id: replace '/' with '_'

cad4777

ocrd.cli.ocrd-tool resolve-resource: fix (forgot to print result)

10b2abc

processor CLI: delegate --resolve-resource, too

bd64444

METS Server: also export+delegate physical_pages

71e9841

ocrd.cli.workspace: consistently pass on --mets-server-url and --back…

01ccdf1

…up (also, simplify)

ocrd.cli.workspace server: add 'reload' and 'save'

3301f9c

ocrd.cli.bashlib input-files: pass on --mets-server-url, too

dc2c758

ocrd.cli.validate tasks: pass on --mets-server-url, too

42af6a3

Processor.process_workspace(): do not show NotImplementedError contex…

7ea8d57

…t if fallback process() raises anything itself

Processor.verify: check output fileGrps as well (or OCRD_EXISTING_OUT…

9751256

…PUT=OVERWRITE|SKIP or disjoint --page-id)

run_processor: be robust if ocrd_tool is missing steps

f66753a

lib.bash: fix errexit

eb12a80

lib.bash input-files: pass on --mets-server-url, --overwrite, and par…

3355ea4

…ameters (necessary for required params)

lib.bash input-files: do not try to validate tasks here (impossible t…

f05f840

…o get right with required parameters, and now covered by wrapped Processor.verify() already)

Processor / Workspace.add_file: always force if config.OCRD_EXISTING_…

b5c1191

…OUTPUT==OVERWRITE

test processors: no need for 'force' kwarg anymore

cbe465a

tests: make sure ocrd_utils.config gets reset whenever changing it gl…

3e214ca

…obally

OcrdPage: add PageType.get_ReadingOrderGroups()

c549c42

update OcrdPage from generateds

53b880f

bertsky force-pushed the new-processor-api branch from 0fb234b to 53b880f Compare September 15, 2024 23:54

bertsky mentioned this pull request Sep 16, 2024

v3 API: general XPath 2.0 mechanism, generateDS true reverse mapping, ocrd-filter bertsky/core#21

Open

kba and others added 3 commits September 16, 2024 13:29

📦 v3.0.0b5

687b06f

📝 improve b5 changelog

a43098e

ocrd.cli.workspace: assert non-server in cmds mutating METS

d2cb0fb

bertsky mentioned this pull request Sep 24, 2024

#1240 backports for v2 #1275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New processor API #1240

New processor API #1240

bertsky commented Jun 24, 2024

bertsky commented Jun 24, 2024 •

edited

Loading

MehmedGIT left a comment

MehmedGIT Jun 25, 2024

bertsky Jun 25, 2024

bertsky commented Jul 5, 2024 •

edited

Loading

bertsky commented Sep 12, 2024

bertsky commented Sep 12, 2024 •

edited

Loading

New processor API #1240

Are you sure you want to change the base?

New processor API #1240

Conversation

bertsky commented Jun 24, 2024

bertsky commented Jun 24, 2024 • edited Loading

MehmedGIT left a comment

Choose a reason for hiding this comment

MehmedGIT Jun 25, 2024

Choose a reason for hiding this comment

bertsky Jun 25, 2024

Choose a reason for hiding this comment

bertsky commented Jul 5, 2024 • edited Loading

bertsky commented Sep 12, 2024

bertsky commented Sep 12, 2024 • edited Loading

bertsky commented Jun 24, 2024 •

edited

Loading

bertsky commented Jul 5, 2024 •

edited

Loading

bertsky commented Sep 12, 2024 •

edited

Loading