Ground truth for German newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (German Imperial Gazette and Prussian Official Gazette), which was published under changing names from 1819 to 1945 (https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/ausgaben).
The ground truth is provided as PAGE-XML and URLs for the corresponding newspaper scans/images. Use the provided bash-script to download the images.
Images can be downloaded via script
./download_images.sh
- 197 single newspaper pages
- 119 429 ground truth lines
1820–1939
Fraktur, Latin
German, English, French, Portuguese, Italian, Latin
All transcriptions were created using Transkribus. The transcription rules are based on the OCR-D transcription guidelines Level 2 with some exceptions (see below):
Special characters:
- Long s (ſ)
- Currency symbols: German Mark (ℳ) and Pfennig (₰), $, £
- Fractions (¼ ½ ¾ ⅐ ⅑ ⅒ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞)
- Fraction slash (⁄) (U+2044), if
- can't be transcribed by a unicode fraction representation
- numerator and denominator are not on the same baseline height
- R rotunda (ꝛ)
- Combining Latin Small Letter E for old German Umlaut ( ͤ )
- Dagger (†)
- Black Right Pointing Index (☛)
- Black Left Pointing Index (☚)
- White square (□)
- Superscript Numbers 0-9 (⁰¹²³⁴⁵⁶⁷⁸⁹)
Normalizations:
- Roman numerals ⅠⅤ Ⅹ Ⅼ Ⅽ Ⅾ Ⅿ --> I V X L C D M
- Em dash (—) instead of En dash (–)
- Asterisk (*) used for both standard asterisk (*) and tear-drop asterisk (✽)
Additional characters transcribed true to original (contrary to OCR-D Level 2):
- Double oblique hyphen (⸗)
This revision is predominantly funded by the German Research Foundation (DFG).
- A digital edition of the Reichsanzeiger is provided by Mannheim University Library.
- See the Reichsanzeiger wiki for more information.