Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapt to decentralized resources #801

Closed
6 tasks done
bertsky opened this issue Feb 14, 2022 · 6 comments
Closed
6 tasks done

adapt to decentralized resources #801

bertsky opened this issue Feb 14, 2022 · 6 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Feb 14, 2022

To implement OCR-D/spec#181, AFAICS we need:

  • code to mix the Processor.ocrd_tool['resources'] with the preinstalled centralized database in OcrdResourceManager's constructor and .load_resource_list
  • code to adhere to Processor.ocrd_tool['resource_locations'] in .resolve_resource and .list_all_resources
  • changes to the ocrd_validators.resource_list_validator, specifically its resource_list.schema.yml (subtypes of type)
  • changes to the ocrd_validators.ocrd_tool_validator, specifically its ocrd_tool.schema.yml by updating repo/spec

Did I forget anything?

EDIT

  • aid in migrating user ocrd/resources.yml from old syntax to new one
  • update documentation (spec, guides, wiki)
@kba
Copy link
Member

kba commented Feb 14, 2022

Looks fairly complete for changes in core, thank you. I'll extend #800 for the latter three points, the first one is already in place (ocrd resmgr discover).

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 14, 2022

the first one is already in place (ocrd resmgr discover).

That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.

@kba
Copy link
Member

kba commented Feb 14, 2022

That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.

Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 14, 2022

Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.

Good point. But the distinction between constructor and setup for processing will become the norm (and if done right, could also be used to keep heavy-toll dependencies like Tensorflow imports out of the non-processing contexts). And in the context of a processor (i.e. outside of ocrd.cli.resmgr), at least for the Pythonic ones, there is no need for a --dump-json, as one can directly pick it up from the pkg_resources.

And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.

So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?

@kba
Copy link
Member

kba commented Feb 14, 2022

And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.

I've tested it and including all the spurious results like ocrd-make, ocrd-import etc., ocrd resmgr discover takes at most 90 seconds for a fairly complete ocrd_all installation. Since we can model the code in such a way that it is conservative with what is searched.

So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?

If we do it completely dynamically then yes, that would be the only case. And we save ourselves future headaches, if we don't try to serialize the resource descriptions to resource_list.yml (deduplication, updated processors etc. it's likely a nightmare).

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 17, 2022

I took the liberty of checking tasks that have been solved by now above.

  • code to mix the Processor.ocrd_tool['resources'] with the preinstalled centralized database in OcrdResourceManager's constructor and .load_resource_list

seems complete:

if dynamic:
for exec_dir in environ['PATH'].split(':'):
for exec_path in Path(exec_dir).glob(f'{executable}'):
self.log.info(f"Inspecting '{exec_path} --dump-json' for resources")
ocrd_tool = get_ocrd_tool_json(exec_path)
for resdict in ocrd_tool.get('resources', ()):
if exec_path.name not in self.database:
self.database[exec_path.name] = []
for res_remove in (res for res in self.database.get(executable, []) if res['name'] == resdict['name']):
self.database.get(executable).remove(res_remove)
self.database[exec_path.name].append(resdict)

  • code to adhere to Processor.ocrd_tool['resource_locations'] in .resolve_resource and .list_all_resources

resolve_resource / list_resource_candidates does not adhere to resource_locations yet:

candidates = []
candidates.append(join(cwd, fname))
xdg_data_home = XDG_DATA_HOME if not xdg_data_home else xdg_data_home
processor_path_var = '%s_PATH' % executable.replace('-', '_').upper()
if processor_path_var in environ:
candidates += [join(x, fname) for x in environ[processor_path_var].split(':')]
candidates.append(join(xdg_data_home, 'ocrd-resources', executable, fname))
candidates.append(join('/usr/local/share/ocrd-resources', executable, fname))
if moduled:
candidates.append(join(moduled, fname))
return candidates

Also, as discussed in ocrd_tesserocr, we still need to handle resource location module correctly.

list_all_resources seems complete though:

try:
resource_locations = get_ocrd_tool_json(executable)['resource_locations']
except FileNotFoundError:
# processor we're looking for ressource_locations of is not installed.
# Assume the default
resource_locations = ['data', 'cwd', 'system', 'module']

  • changes to the ocrd_validators.resource_list_validator, specifically its resource_list.schema.yml (subtypes of type)

seems complete:

RESOURCE_LIST_SCHEMA = {
'type': 'object',
'additionalProperties': False,
'patternProperties': {
'^ocrd-.*': OCRD_TOOL_SCHEMA['properties']['tools']['patternProperties']['ocrd-.*']['properties']['resources']
}
}
and https://github.com/OCR-D/core/blob/master/ocrd_validators/ocrd_validators/resource_list_validator.py and
report = OcrdResourceListValidator.validate(list_loaded)

  • changes to the ocrd_validators.ocrd_tool_validator, specifically its ocrd_tool.schema.yml by updating repo/spec

seems complete:

resources:
type: array
description: Resources for this processor
items:
type: object
additionalProperties: false
required:
- url
- description
- name
- size
properties:
url:
type: string
description: URLs of all components of this resource
description:
type: string
description: A description of the resource
name:
type: string
description: Name to store the resource as
type:
type: string
enum: ['file', 'directory', 'archive']
default: file
description: Type of the URL
parameter_usage:
type: string
description: Defines how the parameter is to be used
enum: ['as-is', 'without-extension']
default: 'as-is'
path_in_archive:
type: string
description: if type is archive, the resource is at this location in the archive
default: '.'
version_range:
type: string
description: Range of supported versions, syntax like in PEP 440
default: '>= 0.0.1'
size:
type: number
description: Size of the resource in bytes

  • aid in migrating user ocrd/resources.yml from old syntax to new one

seems complete:

@resmgr_cli.command('migrate')
@click.argument('migration', type=click.Choice(['2.37.0']))
def migrate(migration):
"""
Update the configuration after updating core to MIGRATION
"""
resmgr = OcrdResourceManager(skip_init=True)
log = getLogger('ocrd.resmgr.migrate')
if not resmgr.user_list.exists():
log.info(f'No configuration file found at {resmgr.user_list}, nothing to do')
if migration == '2.37.0':
backup_file = resmgr.user_list.with_suffix(f'.yml.before-{migration}')
yaml_in_str = resmgr.user_list.read_text()
log.info(f'Backing {resmgr.user_list} to {backup_file}')
backup_file.write_text(yaml_in_str)
log.info(f'Applying migration {migration} to {resmgr.user_list}')
yaml_in = safe_load(yaml_in_str)
yaml_out = {}
for executable, reslist_in in yaml_in.items():
yaml_out[executable] = []
for resdict_in in reslist_in:
resdict_out = {}
for k_in, v_in in resdict_in.items():
k_out, v_out = k_in, v_in
if k_in == 'type' and v_in in ['github-dir', 'tarball']:
if v_in == 'github-dir':
v_out = 'directory'
elif v_in == 'tarball':
v_out = 'directory'
resdict_out[k_out] = v_out
yaml_out[executable].append(resdict_out)
resmgr.user_list.write_text(RESOURCE_USER_LIST_COMMENT +
'\n# migrated with ocrd resmgr migrate {migration}\n' +
safe_dump(yaml_out))
log.info(f'Applied migration {migration} to {resmgr.user_list}')

  • update documentation (spec, guides, wiki)

already tracked by OCR-D/spec#193

Not sure about the guides and Wiki yet. I did find the section about resmgr download syntax to be out of date though.

@bertsky bertsky closed this as completed Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants