A collection of handwritten ground truth for HTR training.
This collection is based on various manuscript editions of the Digital Humanities in order to provide the edited texts (transcriptions) as ground truth for training HTR models.
All ground truth is provided as PAGE XML. All transcriptions are based on the OCR-D transcription guidelines Level 2.
See sections below for individual data set descriptions.
Folder | Source | Pages | Lines | License |
---|---|---|---|---|
gsa_389889 | faustedition | 8 | 230 | CC BY-NC-SA 4.0 |
gsa_390028 | faustedition | 94 | 2493 | CC BY-NC-SA 4.0 |
gsa_390825 | faustedition | 30 | 743 | CC BY-NC-SA 4.0 |
gsa_391098 | faustedition | 414 | 10178 | CC BY-NC-SA 4.0 |
gsa_391511 | faustedition | 6 | 168 | CC BY-NC-SA 4.0 |
gsa_391347 | faustedition | 35 | 955 | CC BY-NC-SA 4.0 |
gsa_391247 | faustedition | 68 | 1698 | CC BY-NC-SA 4.0 |
671 | 16816 |
Download images using the bash script download_imgs.sh
in each data set folder.
Transcription guidlines: The following normalisations were resolved with respect to OCR-D transcription guidelines Level 2:
- Round brackets:
(
and)
(edition) →/:
and:/
(ground truth) - Hyphens:
-
(edition) →=
(ground truth)
Folder | Source | Pages | Lines | License |
---|---|---|---|---|
A01 | Fontane Edition | 67 | 1046 | CC BY-NC-ND 4.0 |
C13 | Fontane Edition | 53 | 879 | CC BY-NC-ND 4.0 |
120 | 1925 |
Download images using the bash script download_imgs.sh
in each data set folder.
Transcription guidlines: The following normalisations were resolved with respect to OCR-D transcription guidelines Level 2:
Sammlung
(edition) →Sam̄lung
(ground truth)
Folder | Source | Pages | Lines | License |
---|---|---|---|---|
GT_PAGE | Schlegel Briefe | 40 | 788 | CC BY-NC-SA 3.0 |
40 | 788 |
Download images using the bash script download_imgs.sh
in each data set folder.
Transcription guidlines: The following normalisations were resolved with respect to OCR-D transcription guidelines Level 2:
- round
s
(edition) → longſ
(ground truth)