Pretrained models for the ANN-based post-correction module.
All models use cor-asv-ann-train --width 512 --depth 2
and were initialised with weights from a language model via --init-model
, then pretrained on 200k lines of clean text (input=output) from DTA, and then retrained via --load-model
on GT4HistOCR and OCR-D GT, processed by various OCR models (input=OCR with confidence, output=GT). The latter is allowed to change all weights (not just fine-tuning) and does not reset the encoder layer weights (--reset-encoder
).
- dta19.Fraktur4: on 19th century Fraktur texts (GT4HistOCR/corpus/dta19) for the Tesseract 4 model
script/Fraktur
- pre19.Fraktur4: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 4 model
script/Fraktur
- pre19.deu-frak3: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 3 model
deu-frak
- pre19.Latin4: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 4 model
script/Latin
- pre19.deu4: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 4 model
deu
- pre19.incunabula: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Ocropus 1 model
incunabula.pyrnn
included in GT4HistOCR - pre19.latinhist: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Ocropus 1 model
latinhist.pyrnn
included in GT4HistOCR
- gt4histocr.s-ſ: on GT4HistOCR ground truth degraded by replacing
ſ
intos
in the input (encouraging the network to learn its reconstruction)
Above models were run against OCR results from various models on combinations of the above datasets.
GT4HistOCR/dta19
signifies thedta19
subcorpus of GT4HistOCRGT4HistOCR/!dta19
signifies theKallimachos
andIncunabula
subcorpus (i.e. everything but thedta19
subset) of GT4HistOCROCR-D<19
signifies the 15-18th century bags of the OCR-D ground truth repository (i.e.ocrd_data_structur_text_*_1[4567]*
)
In each case, evaluation was done on a (2-10%) validation subset not seen during training (totalling #lines lines).
The names are the same as for post-correction training (see above):
Fraktur4
is the Tesseract 4 modelscript/Fraktur
Latin4
is the Tesseract 4 modelscript/Latin
deu4
is the Tesseract 4 modeldeu
frk4
is the Tesseract 4 modelfrk
deu+frk4
is Tesseractdeu+frk
deu-frak3
is the Tesseract 3 modeldeu-frak
ocrofraktur
is the Ocropus Fraktur modelocrofraktur-jze
is the Ocropus Fraktur model by jzeincunabula
is the Ocropus model by that name included in GT4HistOCRlatinhist
is the Ocropus model by that name included in GT4HistOCRGT4HistOCR
is the Tesseract model trained on GT4HistOCR (for which unfortunately no post-correction model could be trained during the project runtime, because it came too late).
Rejection threshold (RT) has been set to various levels, but while 0.0 will disable rejection completely (i.e. the input hypothesis, if at all identifiable, will keep its predicted score), 1.0 will not disable correction completely (because the input hypothesis might not be found if alignment is too bad).
We usually set it to 0.9 for out-domain and 0.2 for in-domain tasks, but vary where other settings yield better results.
Inductive-deductive post-correction expects to see data during inference that is similar to both the dataset and the OCR model during training. When this (narrow) circumstance is met, we have the in-domain case, the other is referred to as out-domain.
Quality is measured by aligning output and GT lines, calculating the (unweighted) Levenshtein distance along the best path, and dividing it by the length of that path (not of the GT sequence). Rates are aggregated across lines and files in micro-average fashion.
Character here does not mean Unicode codepoint, but glyph (i.e. combining characters are applied before any edit operations; this amounts to the metric=Levenshtein
parameter of cor-asv-ann-eval
and ocrd-cor-asv-ann-evaluate
).
COR model | OCR model | dataset | #lines | CER OCR | CER COR | RT | comment |
---|---|---|---|---|---|---|---|
dta19.Fraktur4 | Fraktur4 | GT4HistOCR/dta19 | 19302 | 7.5 | 4.7 | 0.5 | in-domain |
dta19.Fraktur4 | Fraktur4 | GT4HistOCR/dta19 | 4907 | 6.2 | 3.7 | 0.2 | in-domain |
dta19.Fraktur4 | Fraktur4 | GT4HistOCR/dta19 | 4907 | 6.2 | 3.4 | 0.5 | in-domain |
dta19.Fraktur4 | deu-frak3 | GT4HistOCR/dta19 | 4907 | 9.0 | 8.9 | 0.9 | out-domain |
dta19.Fraktur4 | deu-frak3 | GT4HistOCR/dta19 | 4907 | 9.0 | 8.4 | 0.5 | out-domain |
dta19.Fraktur4 | frk4 | GT4HistOCR/dta19 | 4907 | 6.5 | 5.6 | 0.9 | out-domain |
dta19.Fraktur4 | frk4 | GT4HistOCR/dta19 | 4907 | 6.5 | 4.3 | 0.5 | out-domain |
dta19.Fraktur4 | deu+frk4 | GT4HistOCR/dta19 | 4907 | 6.1 | 5.2 | 0.9 | out-domain |
dta19.Fraktur4 | deu+frk4 | GT4HistOCR/dta19 | 4907 | 6.1 | 4.3 | 0.5 | out-domain |
dta19.Fraktur4 | ocrofraktur | GT4HistOCR/dta19 | 4907 | 10.4 | 9.2 | 0.5 | out-domain |
dta19.Fraktur4 | ocrofraktur-jze | GT4HistOCR/dta19 | 4907 | 7.4 | 6.6 | 0.5 | out-domain |
dta19.Fraktur4 | GT4HistOCR | GT4HistOCR/dta19 | 4907 | 0.5 | 0.7 | 0.9 | out-domain |
dta19.Fraktur4 | GT4HistOCR | GT4HistOCR/dta19 | 4907 | 0.5 | 1.0 | 0.5 | out-domain |
dta19.Fraktur4 | Fraktur4 | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 15.7 | 15.8 | 0.9 | out-domain |
pre19.Fraktur4 | Fraktur4 | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 15.7 | 15.1 | 0.2 | in-domain |
pre19.Fraktur4 | Fraktur4 | GT4HistOCR/!dta19 | 5844 | 15.4 | 14.9 | 0.2 | in-domain |
pre19.Fraktur4 | Fraktur4 | OCR-D<19 | 227 | 22.9 | 21.1 | 0.2 | in-domain |
pre19.Fraktur4 | Fraktur4 | GT4HistOCR/dta19 | 4907 | 6.2 | 6.5 | 0.9 | out-domain |
pre19.Fraktur4 | deu-frak3 | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 27.2 | 27.3 | 0.9 | out-domain |
pre19.deu-frak3 | deu-frak3 | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 27.2 | 22.2 | 0.2 | in-domain |
pre19.deu-frak3 | deu-frak3 | GT4HistOCR/dta19 | 4907 | 9.0 | 9.2 | 0.9 | out-domain |
pre19.deu-frak3 | Fraktur4 | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 15.7 | 16.1 | 0.9 | out-domain |
pre19.incunabula | incunabula | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 3.8 | 7.7 | 0.2 | in-domain |
pre19.incunabula | incunabula | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 3.8 | 4.1 | 0.9 | in-domain |
pre19.incunabula | incunabula | GT4HistOCR/!dta19 | 5844 | 1.8 | 2.0 | 0.9 | in-domain |
pre19.latinhist | latinhist | GT4HistOCR/!dta19 + OCR-D<19 | 6071 | 34.2 | 26.8 | 0.2 | in-domain |
gt4histocr.s-ſ | GT with s/ſ/s/g |
GT4HistOCR | 56591 | 2.7 | 0.3 | 0.9 | in-domain |
gt4histocr.s-ſ | deu-frak3 | GT4HistOCR | 10978 | 19.1 | 16.7 | 0.9 | out-domain |
As can be seen, performance on out-domain data is very poor (despite already adapting the rejection threshold). Rejection is limited here, because the model can fail to find the rejection hypothesis when alignment is too bad already, and of course the model is blind towards its own errors.
Moreover, in-domain results show that the text cannot be reconstructed more than a certain extent (i.e. if OCR results are bad, then post-correction results will be as well, and vice versa; improvement is usually "absolute", not "relative"). Also, the pre19
models are not nearly as good as the dta19
models (for their respective task).
It remains to be investigated further, but it seems plausible that the key reasons for this are quality/quantity of training data. During the runtime of the project (2018-2019), there was:
- too little textual ground truth (and not representative), especially in the
pre19
category, - no good baseline OCR model freely available (as is now, cf. Tesseract models and Calamari models),
- no OCR that could (correctly) share alternative hypotheses (besides
deu-frak3
).
The latter point could also be overcome by adding multi-OCR alignment. Fully utilising OCR hypotheses (from multiple engines or a single one) would also open the prospect of learning large generic models.