adapt to decentralized resources #801

bertsky · 2022-02-14T13:08:10Z

To implement OCR-D/spec#181, AFAICS we need:

code to mix the Processor.ocrd_tool['resources'] with the preinstalled centralized database in OcrdResourceManager's constructor and .load_resource_list
code to adhere to Processor.ocrd_tool['resource_locations'] in .resolve_resource and .list_all_resources
changes to the ocrd_validators.resource_list_validator, specifically its resource_list.schema.yml (subtypes of type)
changes to the ocrd_validators.ocrd_tool_validator, specifically its ocrd_tool.schema.yml by updating repo/spec

Did I forget anything?

EDIT

aid in migrating user ocrd/resources.yml from old syntax to new one
update documentation (spec, guides, wiki)

The text was updated successfully, but these errors were encountered:

kba · 2022-02-14T13:25:05Z

Looks fairly complete for changes in core, thank you. I'll extend #800 for the latter three points, the first one is already in place (ocrd resmgr discover).

bertsky · 2022-02-14T13:33:51Z

the first one is already in place (ocrd resmgr discover).

That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.

kba · 2022-02-14T13:53:54Z

That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.

Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.

bertsky · 2022-02-14T14:05:33Z

Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.

Good point. But the distinction between constructor and setup for processing will become the norm (and if done right, could also be used to keep heavy-toll dependencies like Tensorflow imports out of the non-processing contexts). And in the context of a processor (i.e. outside of ocrd.cli.resmgr), at least for the Pythonic ones, there is no need for a --dump-json, as one can directly pick it up from the pkg_resources.

And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.

So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?

kba · 2022-02-14T15:10:36Z

And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.

I've tested it and including all the spurious results like ocrd-make, ocrd-import etc., ocrd resmgr discover takes at most 90 seconds for a fairly complete ocrd_all installation. Since we can model the code in such a way that it is conservative with what is searched.

So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?

If we do it completely dynamically then yes, that would be the only case. And we save ourselves future headaches, if we don't try to serialize the resource descriptions to resource_list.yml (deduplication, updated processors etc. it's likely a nightmare).

bertsky · 2022-08-17T13:54:11Z

I took the liberty of checking tasks that have been solved by now above.

code to mix the Processor.ocrd_tool['resources'] with the preinstalled centralized database in OcrdResourceManager's constructor and .load_resource_list

seems complete:

core/ocrd/ocrd/resource_manager.py

Lines 105 to 115 in ceb9992

    
           if dynamic: 
        
               for exec_dir in environ['PATH'].split(':'): 
        
                   for exec_path in Path(exec_dir).glob(f'{executable}'): 
        
                       self.log.info(f"Inspecting '{exec_path} --dump-json' for resources") 
        
                       ocrd_tool = get_ocrd_tool_json(exec_path) 
        
                       for resdict in ocrd_tool.get('resources', ()): 
        
                           if exec_path.name not in self.database: 
        
                               self.database[exec_path.name] = [] 
        
                           for res_remove in (res for res in self.database.get(executable, []) if res['name'] == resdict['name']): 
        
                               self.database.get(executable).remove(res_remove) 
        
                           self.database[exec_path.name].append(resdict)

code to adhere to Processor.ocrd_tool['resource_locations'] in .resolve_resource and .list_all_resources

resolve_resource / list_resource_candidates does not adhere to resource_locations yet:

core/ocrd_utils/ocrd_utils/os.py

Lines 97 to 107 in ceb9992

    
           candidates = [] 
        
           candidates.append(join(cwd, fname)) 
        
           xdg_data_home = XDG_DATA_HOME if not xdg_data_home else xdg_data_home 
        
           processor_path_var = '%s_PATH' % executable.replace('-', '_').upper() 
        
           if processor_path_var in environ: 
        
               candidates += [join(x, fname) for x in environ[processor_path_var].split(':')] 
        
           candidates.append(join(xdg_data_home, 'ocrd-resources', executable, fname)) 
        
           candidates.append(join('/usr/local/share/ocrd-resources', executable, fname)) 
        
           if moduled: 
        
               candidates.append(join(moduled, fname)) 
        
           return candidates

Also, as discussed in ocrd_tesserocr, we still need to handle resource location module correctly.

list_all_resources seems complete though:

core/ocrd_utils/ocrd_utils/os.py

Lines 115 to 120 in ceb9992

    
           try: 
        
               resource_locations = get_ocrd_tool_json(executable)['resource_locations'] 
        
           except FileNotFoundError: 
        
               # processor we're looking for ressource_locations of is not installed. 
        
               # Assume the default 
        
               resource_locations = ['data', 'cwd', 'system', 'module']

changes to the ocrd_validators.resource_list_validator, specifically its resource_list.schema.yml (subtypes of type)

seems complete:

core/ocrd_validators/ocrd_validators/constants.py

Lines 22 to 28 in ceb9992

    
           RESOURCE_LIST_SCHEMA = { 
        
               'type': 'object', 
        
               'additionalProperties': False, 
        
               'patternProperties': { 
        
                   '^ocrd-.*': OCRD_TOOL_SCHEMA['properties']['tools']['patternProperties']['ocrd-.*']['properties']['resources'] 
        
               } 
        
           }

and https://github.com/OCR-D/core/blob/master/ocrd_validators/ocrd_validators/resource_list_validator.py and

core/ocrd/ocrd/resource_manager.py

Line 88 in 6196e3b

report = OcrdResourceListValidator.validate(list_loaded)

changes to the ocrd_validators.ocrd_tool_validator, specifically its ocrd_tool.schema.yml by updating repo/spec

seems complete:

core/ocrd_validators/ocrd_validators/ocrd_tool.schema.yml

Lines 168 to 209 in ceb9992

    
           resources: 
        
             type: array 
        
             description: Resources for this processor 
        
             items: 
        
               type: object 
        
               additionalProperties: false 
        
               required: 
        
                 - url 
        
                 - description 
        
                 - name 
        
                 - size 
        
               properties: 
        
                 url: 
        
                   type: string 
        
                   description: URLs of all components of this resource 
        
                 description: 
        
                   type: string 
        
                   description: A description of the resource 
        
                 name: 
        
                   type: string 
        
                   description: Name to store the resource as 
        
                 type: 
        
                   type: string 
        
                   enum: ['file', 'directory', 'archive'] 
        
                   default: file 
        
                   description: Type of the URL 
        
                 parameter_usage: 
        
                   type: string 
        
                   description: Defines how the parameter is to be used 
        
                   enum: ['as-is', 'without-extension'] 
        
                   default: 'as-is' 
        
                 path_in_archive: 
        
                   type: string 
        
                   description: if type is archive, the resource is at this location in the archive 
        
                   default: '.' 
        
                 version_range: 
        
                   type: string 
        
                   description: Range of supported versions, syntax like in PEP 440 
        
                   default: '>= 0.0.1' 
        
                 size: 
        
                   type: number 
        
                   description: Size of the resource in bytes

aid in migrating user ocrd/resources.yml from old syntax to new one

seems complete:

core/ocrd/ocrd/cli/resmgr.py

Lines 161 to 195 in 4148a88

    
           @resmgr_cli.command('migrate') 
        
           @click.argument('migration', type=click.Choice(['2.37.0'])) 
        
           def migrate(migration): 
        
               """ 
        
               Update the configuration after updating core to MIGRATION 
        
               """ 
        
               resmgr = OcrdResourceManager(skip_init=True) 
        
               log = getLogger('ocrd.resmgr.migrate') 
        
               if not resmgr.user_list.exists(): 
        
                   log.info(f'No configuration file found at {resmgr.user_list}, nothing to do') 
        
               if migration == '2.37.0': 
        
                   backup_file = resmgr.user_list.with_suffix(f'.yml.before-{migration}') 
        
                   yaml_in_str = resmgr.user_list.read_text() 
        
                   log.info(f'Backing {resmgr.user_list} to {backup_file}') 
        
                   backup_file.write_text(yaml_in_str) 
        
                   log.info(f'Applying migration {migration} to {resmgr.user_list}') 
        
                   yaml_in = safe_load(yaml_in_str) 
        
                   yaml_out = {} 
        
                   for executable, reslist_in in yaml_in.items(): 
        
                       yaml_out[executable] = [] 
        
                       for resdict_in in reslist_in: 
        
                           resdict_out = {} 
        
                           for k_in, v_in in resdict_in.items(): 
        
                               k_out, v_out = k_in, v_in 
        
                               if k_in == 'type' and v_in in ['github-dir', 'tarball']: 
        
                                   if v_in == 'github-dir': 
        
                                       v_out = 'directory' 
        
                                   elif v_in == 'tarball': 
        
                                       v_out = 'directory' 
        
                               resdict_out[k_out] = v_out 
        
                           yaml_out[executable].append(resdict_out) 
        
                   resmgr.user_list.write_text(RESOURCE_USER_LIST_COMMENT + 
        
                           '\n# migrated with ocrd resmgr migrate {migration}\n' + 
        
                           safe_dump(yaml_out)) 
        
                   log.info(f'Applied migration {migration} to {resmgr.user_list}')

update documentation (spec, guides, wiki)

already tracked by OCR-D/spec#193

Not sure about the guides and Wiki yet. I did find the section about resmgr download syntax to be out of date though.

bertsky closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt to decentralized resources #801

adapt to decentralized resources #801

bertsky commented Feb 14, 2022 •

edited

Loading

kba commented Feb 14, 2022

bertsky commented Feb 14, 2022

kba commented Feb 14, 2022

bertsky commented Feb 14, 2022 •

edited

Loading

kba commented Feb 14, 2022

bertsky commented Aug 17, 2022

adapt to decentralized resources #801

adapt to decentralized resources #801

Comments

bertsky commented Feb 14, 2022 • edited Loading

kba commented Feb 14, 2022

bertsky commented Feb 14, 2022

kba commented Feb 14, 2022

bertsky commented Feb 14, 2022 • edited Loading

kba commented Feb 14, 2022

bertsky commented Aug 17, 2022

bertsky commented Feb 14, 2022 •

edited

Loading

bertsky commented Feb 14, 2022 •

edited

Loading