Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New processor API #1240

Open
wants to merge 218 commits into
base: master
Choose a base branch
from
Open

New processor API #1240

wants to merge 218 commits into from

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Jun 24, 2024

This is a first attempt at implementing #322 – without the actual error handling or parallel processing (so more or less refactoring and deprecation). It also enables fixing #579 in the process.

- add method `process_workspace(workspace)`
  as a replacement for passing `workspace`
  in the constructor and then calling `process`
  (implemented by subclasses): implement in the
  superclass
  - loop over input files
  - delegate processing to new method
    `process_page_file()` if possible
  - otherwise fall back to old `process()`
    outside of loop
  - download input files when needed
    if `self.download`
- add method `process_page_file()` as single-page
  processing procedure on OcrdFiles: implement
  in the superclass for the most frequent/default
  use-case of
  - (multi-) image/PAGE input files
  - (single) PAGE output files
  - delegate to new method `process_page_pcgts()`
    if available
  - add PAGE processing metadata
  - set PAGE PcGtsId
  - handle `make_file_id` and `workspace.add_file`
- add method `process_page_pcgts()` as single-page
  processing function on OcrdPage: to be implemented
  only by subclasses
- constructor: add kwarg `download_files` controlling
  `self.download` (see above)
- implement `process_page_pcgts` with behaviour for `copy_files=False`
- override superclass `process_page_file` with behaviour for `copy_files=True`
- remove old `process` implementation
…load_files=False` in tests (because they are not actually in the filesystem)
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 24, 2024

A couple of points worth discussing:

  • In contrast to the formulation in change API to get page-level parallelization everywhere #322, there is no single process_page now. We have too many cases to consider of what processors may need to do.
    I decided to approach this by splitting up into process_page_file (taking an OcrdFile as input and adding the result to the workspace with nothing in return) and process_page_pcgts (taking a parsed OcrdPage as input and returning the modified OcrdPage): The latter must be overriden, the former can (but uses the latter in its default implementation).
    That should be sufficiently general to be useful in adopting the new API, but let's see if we can indeed bring all processors into that paradigm. (For now I just provided the API adaptation for the builtin dummy processor.)
  • We have a design choice of providing a trunk definition in the superclass throwing NotImplementedError vs. not having the method at all. The latter can quickly be tested via hasattr, the former must be caught. For the pure PAGE processing function, I think it makes most sense to use the former.
  • To avoid the whole chdir business once and for all, I made process_workspace take the workspace as a mandatory argument, and deprecate passing a workspace in the constructor.
  • I did not sufficiently separate processing from non-processing context in the constructor, yet. The goal here was to first find a viable path in that direction.
    Thus, I did add a setup method already, but the constructor still determines non-processing or processing context. I believe we should just use methods for that e.g. show_resources instead of taking that as a kwarg in the constructor. But it's an even larger change. (Hence the fixme comments about deprecating this.)
  • All of our actual (external) processors use workspace.download_file() on their input files, so that's worth including in the superclass behaviour. However, since most of our processor tests (in core) do not process any actual files, which I did not want to rewrite completely, I had to add an option download_files=False to avoid attempting that.

@bertsky bertsky requested review from kba, MehmedGIT and joschrew and removed request for kba and MehmedGIT June 24, 2024 14:51
Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great start.

src/ocrd/decorators/__init__.py Outdated Show resolved Hide resolved
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
with pushd_popd(workspace.directory):
self.workspace = workspace
try:
# FIXME: add page parallelization by running multiprocessing.Pool (#322)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the Pool workers would be also controllable by an environment variable to ease resource distribution in HPC environments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea, exactly.

src/ocrd/processor/base.py Outdated Show resolved Hide resolved
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
@bertsky
Copy link
Collaborator Author

bertsky commented Jul 5, 2024

BTW, I consider it bad practice that we always load the ocrd-tool.json in the inheriting constructor.

It should really be a class attribute. And assuming it is always placed in the same spot of the Python distribution (which is currently the case but not strictly required as our specs only require it in the repository root), one can even load it in the superclass (so nothing would have to be done in the inheriting code).

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 12, 2024

Another backwards compatibility problem is behaviour for --overwrite: our v2 processors use their own Workspace.add_file calls, without setting force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE'. They expect the workspace to implicitly handle that via Workspace.overwrite_mode, which I removed in favour of the config mechanism.

Any ideas how we can best get a workaround for that (at least for some grace period)?

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 12, 2024

Another backwards compatibility problem is behaviour for --overwrite: our v2 processors use their own Workspace.add_file calls, without setting force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE'. They expect the workspace to implicitly handle that via Workspace.overwrite_mode, which I removed in favour of the config mechanism.

Perhaps, in v3, instead of using force=ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE' whereever we can control, we should rather do if force or ocrd_utils.config.OCRD_EXISTING_OUTPUT == 'OVERWRITE' within add_file generally?

EDIT: I did exactly that – in hindsight, it also seems natural.

…t if fallback process() raises anything itself
…o get right with required parameters, and now covered by wrapped Processor.verify() already)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants