Skip to content

OCR-D/quiver-benchmarks

Repository files navigation

QuiVer Benchmarks

QuiVer Benchmarks is a tool that helps you decide which OCR-D workflows are most suitable for your data. It executes preset workflows on different kinds of Ground Truth and evaluates the result. The results with the most recent version of ocrd_all can be viewed at https://ocr-d.de/quiver-frontend.

This repository holds everything needed to automatically execute different OCR-D workflows on images and evaluate the outcomes. It creates benchmarks for OCR-D data in a containerized environment. QuiVer Benchmarks currently runs in an automated workflow (CI/CD).

QuiVer Benchmarks is based on ocrd/all:maximum and has all OCR-D processors at hand that a workflow might use.

Requirements

To speed up QuiVer Benchmarks you can mount already downloaded text recognition models to /usr/local/share/ocrd-resources/ in docker-compose.yml by adding

- path/to/your/models:/usr/local/share/ocrd-resources/

to the volumes section. Otherwise, the tool will download all ocrd-tesserocr-recognize models as well as ocrd-calamari-recognize qurator-gt4histocr-1.0 on each run.

Usage (For Development)

  • clone this repository and switch to the cloned directory
  • build the image with make build
  • spin up a container with make start
  • run make prepare-default-gt
  • run make run
  • the benchmarks and the evaluation results will be available at data/workflows.json on your host system
  • when finished, run make stop to shut down and remove the Docker container you created previously

Benchmarks Considered

The relevant benchmarks gathered by QuiVer Benchmarks are defined in OCR-D's Quality Assurance specification and comprise

  • CER (per page and document wide), incl.
    • median
    • minimum and maximum CER
    • standard deviation
  • WER (per page and document wide)
  • CPU time
  • wall time
  • processed pages per minute

Ground Truth Used

QuiVer Benchmarks currently uses the following Ground Truth:

A detailed list of images used for the Reichsanzeiger GT sets can be found in the data_src directory.

Adding New OCR-D Workflows (For Development)

Add new OCR-D workflows to the directory workflows/ocrd_workflows according to the following conventions:

  • OCR workflows have to end with _ocr.txt, evaluation workflows with _eval.txt. The files will be converted by OtoN to Nextflow files after the container has started.
  • workflows have to be TXT files
  • all workflows have to use ocrd process

You can then either rebuild the Docker image via docker compose build or mount the directory to the container via

- ./workflows/ocrd_workflows:/app/workflows/ocrd_workflows

in the volumes section and spin up a new run with docker compose up.

Removing OCR-D Workflows

Delete the respective TXT files from workflows/ocrd_workflows and either rebuild the image or mount the directory as volume as described above.

Outlook

  • enable users to use their own Ground Truth and workflows

License

See LICENSE

About

Benchmarking OCR-D workflows in Docker

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published