This corpus includes Ground Truth (GT) data compiled considering the following feature:
- Classification into font groups: Gothic/Blackletter, Antiqua and FontMix (Antiqua and Blackletter)
distinction of the selected print type or combinations - Classification into simple and complex
compelexity of the layout (columns, footnotes,...)
The data are also divided according to the time of creation or production.
The data were created according to the OCR-D Ground Truth Guideline (https://ocr-d.de/en/gt-guidelines/trans/).
Gothic/Blackletter
simple
Antiqua
FontMix (Antiqua and Blackletter)
The GT data has been labeled. The labeling is based on an ontology defined by the Pattern Recognition and Image Analysis Research Lab (PRImA-Research-Lab) at the University of Salford. The labeling metadata is created for each available page. The following labeling metadata is available for the different collections.
see: gt-labelling : semantic-labelling OCR ground truth data (https://github.com/OCR-D/gt-labelling)
simple
-
activityDomain/computing/visual/analysisRecognition/layoutAnalysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)
Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
-
activityDomain/computing/visual/analysisRecognition/ocr
-
activityDomain/computing/visual/analysisRecognition/text
Translation of any kind of depicted symbols to machine readable format
Examples: OCR Mathematical equation recognition
Related: Text processing (separate category) Table recognition Map reading
-
condition/acquisition/method-flaws/imaging/uneven-illumination
Uneven illumination leading to brightness or contrast variations
-
condition/production-related/document-characteristics/low-contrast
The contrast bwtween the paper and the page content is very low
-
condition/production-related/document-faults/ink-from-facing
Ink from facing page was transferred to this page
-
condition/wear/additions/informative/annotations
Annotations regarding the content
-
content-encoding/structured
E.g. XML
-
content-type/corpus
Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Examples: A text corpus, An image database
-
contentOfInterest/visual/graphical
Description coming soon.
-
contentOfInterest/visual/graphical/separator
Description coming soon.
-
contentOfInterest/visual/text
Description coming soon.
-
data-attributes/document-related/structural/running-titles
Titles repeated each page
-
data-attributes/document-related/visual/text/drop-caps
Drap capitals (large capitals at beginning of paragraph)
-
data-attributes/document-related/visual/text/font/multi-font/font-sizes
More than one font size used
-
data-attributes/document-related/visual/text/font/multi-font/typefaces
More than one typeface used
-
data-attributes/document-related/visual/text/font/typeface/antiqua
Antiqua font (more modern)
-
data-attributes/document-related/visual/text/font/typeface/blackletter
Blackletter, gothic, Fraktur
-
data-attributes/language/mixed
More than one language used
-
granularity/logical/document-related/paragraph
Description coming soon.
-
granularity/physical/document-related/page
Description coming soon.
-
granularity/physical/document-related/region
Region, zone, block
-
granularity/physical/document-related/text-line
Description coming soon.
-
granularity/physical/document-related/word
Word or partial word, if separated by line break, for example
-
platform/platform-independent
Description coming soon.
complex
-
activityDomain/computing/visual/analysisRecognition/layoutAnalysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)
Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
-
activityDomain/computing/visual/analysisRecognition/ocr
-
activityDomain/computing/visual/analysisRecognition/text
Translation of any kind of depicted symbols to machine readable format
Examples: OCR Mathematical equation recognition
Related: Text processing (separate category) Table recognition Map reading
-
condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding
Part of preceeding or succeeding object included (e.g. other page)
-
condition/acquisition/geometric/page-curl
Visible page curl (e.g. book scanning)
-
condition/acquisition/geometric/perspective-distortions
Perspective distortions (e.g. due to camera-based acquisition)
-
condition/acquisition/method-flaws/imaging/uneven-illumination
Uneven illumination leading to brightness or contrast variations
-
condition/production-related/document-characteristics/low-contrast
The contrast bwtween the paper and the page content is very low
-
condition/production-related/document-faults/ink-from-facing
Ink from facing page was transferred to this page
-
content-encoding/structured
E.g. XML
-
content-type/corpus
Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Examples: A text corpus, An image database
-
contentOfInterest/visual/graphical/separator
Description coming soon.
-
contentOfInterest/visual/text
Description coming soon.
-
data-attributes/document-related/structural/footnote-continued
-
data-attributes/document-related/structural/footnotes
Footnotes at bottom of page
-
data-attributes/document-related/structural/running-titles
Titles repeated each page
-
data-attributes/document-related/visual/text/drop-caps
Drap capitals (large capitals at beginning of paragraph)
-
data-attributes/document-related/visual/text/font/multi-font/font-sizes
More than one font size used
-
data-attributes/document-related/visual/text/font/multi-font/typefaces
More than one typeface used
-
data-attributes/document-related/visual/text/font/typeface/antiqua
Antiqua font (more modern)
-
data-attributes/document-related/visual/text/font/typeface/blackletter
Blackletter, gothic, Fraktur
-
data-attributes/language/mixed
More than one language used
-
granularity/logical/document-related/paragraph
Description coming soon.
-
granularity/physical/document-related/page
Description coming soon.
-
granularity/physical/document-related/region
Region, zone, block
-
granularity/physical/document-related/text-line
Description coming soon.
-
granularity/physical/document-related/word
Word or partial word, if separated by line break, for example
-
platform/platform-independent
Description coming soon.
simple
-
activityDomain/computing/visual/analysisRecognition/layoutAnalysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)
Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
-
activityDomain/computing/visual/analysisRecognition/ocr
-
activityDomain/computing/visual/analysisRecognition/text
Translation of any kind of depicted symbols to machine readable format
Examples: OCR Mathematical equation recognition
Related: Text processing (separate category) Table recognition Map reading
-
condition/acquisition/geometric/page-curl
Visible page curl (e.g. book scanning)
-
condition/acquisition/geometric/perspective-distortions
Perspective distortions (e.g. due to camera-based acquisition)
-
condition/ageing/warping
Arbitrary warping (e.g. due to moisture)
-
condition/production-related/document-faults/ink-from-facing
Ink from facing page was transferred to this page
-
condition/wear/additions/informative/annotations
Annotations regarding the content
-
condition/wear/medium-damage/stains
Noticeable stains on medium
-
content-encoding/structured
E.g. XML
-
content-type/corpus
Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Examples: A text corpus, An image database
-
contentOfInterest/visual/graphical
Description coming soon.
-
contentOfInterest/visual/graphical/separator
Description coming soon.
-
contentOfInterest/visual/text
Description coming soon.
-
data-attributes/document-related/structural/running-titles
Titles repeated each page
-
data-attributes/document-related/visual/text/drop-caps
Drap capitals (large capitals at beginning of paragraph)
-
data-attributes/document-related/visual/text/font/multi-font/font-sizes
More than one font size used
-
data-attributes/document-related/visual/text/font/multi-font/typefaces
More than one typeface used
-
data-attributes/document-related/visual/text/font/typeface/antiqua
Antiqua font (more modern)
-
data-attributes/document-related/visual/text/font/typeface/blackletter
Blackletter, gothic, Fraktur
-
granularity/logical/document-related/paragraph
Description coming soon.
-
granularity/physical/document-related/page
Description coming soon.
-
granularity/physical/document-related/region
Region, zone, block
-
granularity/physical/document-related/text-line
Description coming soon.
-
granularity/physical/document-related/word
Word or partial word, if separated by line break, for example
-
platform/platform-independent
Description coming soon.
complex
-
activityDomain/computing/visual/analysisRecognition/layoutAnalysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)
Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
-
activityDomain/computing/visual/analysisRecognition/ocr
-
activityDomain/computing/visual/analysisRecognition/text
Translation of any kind of depicted symbols to machine readable format
Examples: OCR Mathematical equation recognition
Related: Text processing (separate category) Table recognition Map reading
-
condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding
Part of preceeding or succeeding object included (e.g. other page)
-
condition/acquisition/geometric/page-curl
Visible page curl (e.g. book scanning)
-
condition/acquisition/geometric/perspective-distortions
Perspective distortions (e.g. due to camera-based acquisition)
-
condition/acquisition/method-flaws/imaging/uneven-illumination
Uneven illumination leading to brightness or contrast variations
-
condition/ageing/warping
Arbitrary warping (e.g. due to moisture)
-
condition/production-related/document-characteristics/low-contrast
The contrast bwtween the paper and the page content is very low
-
condition/production-related/document-faults/ink-from-facing
Ink from facing page was transferred to this page
-
condition/wear/additions/informative/annotations
Annotations regarding the content
-
condition/wear/additions/informative/stamps
The medium was stamped
-
condition/wear/medium-damage/stains
Noticeable stains on medium
-
content-encoding/structured
E.g. XML
-
content-type/corpus
Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Examples: A text corpus, An image database
-
contentOfInterest/visual/composite/music
Description coming soon.
-
contentOfInterest/visual/graphical
Description coming soon.
-
contentOfInterest/visual/graphical/separator
Description coming soon.
-
contentOfInterest/visual/text
Description coming soon.
-
data-attributes/document-related/structural/footnotes
Footnotes at bottom of page
-
data-attributes/document-related/structural/running-titles
Titles repeated each page
-
data-attributes/document-related/visual/decorations
Decorations of some kind
-
data-attributes/document-related/visual/illustrations
Illustrations in content
-
data-attributes/document-related/visual/illustrations/multi-colour
Multi-colour illustrations in content
-
data-attributes/document-related/visual/text/drop-caps
Drap capitals (large capitals at beginning of paragraph)
-
data-attributes/document-related/visual/text/font/multi-font/font-sizes
More than one font size used
-
data-attributes/document-related/visual/text/font/multi-font/typefaces
More than one typeface used
-
data-attributes/document-related/visual/text/font/typeface/antiqua
Antiqua font (more modern)
-
data-attributes/document-related/visual/text/font/typeface/blackletter
Blackletter, gothic, Fraktur
-
data-attributes/language/mixed
More than one language used
-
granularity/logical/document-related/paragraph
Description coming soon.
-
granularity/physical/document-related/page
Description coming soon.
-
granularity/physical/document-related/region
Region, zone, block
-
granularity/physical/document-related/text-line
Description coming soon.
-
granularity/physical/document-related/word
Word or partial word, if separated by line break, for example
-
platform/platform-independent
Description coming soon.
simple
-
activityDomain/computing/visual/analysisRecognition/layoutAnalysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)
Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
-
activityDomain/computing/visual/analysisRecognition/ocr
-
activityDomain/computing/visual/analysisRecognition/text
Translation of any kind of depicted symbols to machine readable format
Examples: OCR Mathematical equation recognition
Related: Text processing (separate category) Table recognition Map reading
-
condition/production-related/document-faults/ink-from-facing
Ink from facing page was transferred to this page
-
condition/wear/medium-damage/stains
Noticeable stains on medium
-
content-encoding/structured
E.g. XML
-
content-type/corpus
Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Examples: A text corpus, An image database
-
contentOfInterest/visual/graphical/separator
Description coming soon.
-
contentOfInterest/visual/text
Description coming soon.
-
data-attributes/document-related/visual/text/drop-caps
Drap capitals (large capitals at beginning of paragraph)
-
data-attributes/document-related/visual/text/font/multi-font/font-sizes
More than one font size used
-
data-attributes/document-related/visual/text/font/typeface/antiqua
Antiqua font (more modern)
-
data-attributes/document-related/visual/text/font/typeface/blackletter
Blackletter, gothic, Fraktur
-
granularity/logical/document-related/paragraph
Description coming soon.
-
granularity/physical/document-related/page
Description coming soon.
-
granularity/physical/document-related/region
Region, zone, block
-
granularity/physical/document-related/text-line
Description coming soon.
-
granularity/physical/document-related/word
Word or partial word, if separated by line break, for example
-
platform/platform-independent
Description coming soon.
complex
-
activityDomain/computing/visual/analysisRecognition/layoutAnalysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)
Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.
-
activityDomain/computing/visual/analysisRecognition/ocr
-
activityDomain/computing/visual/analysisRecognition/text
Translation of any kind of depicted symbols to machine readable format
Examples: OCR Mathematical equation recognition
Related: Text processing (separate category) Table recognition Map reading
-
condition/production-related/document-faults/ink-from-facing
Ink from facing page was transferred to this page
-
condition/wear/additions/informative/annotations
Annotations regarding the content
-
condition/wear/medium-damage/stains
Noticeable stains on medium
-
content-encoding/structured
E.g. XML
-
content-type/corpus
Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
Examples: A text corpus, An image database
-
contentOfInterest/visual/text
Description coming soon.
-
data-attributes/document-related/structural/footnote-continued
-
data-attributes/document-related/structural/footnotes
Footnotes at bottom of page
-
data-attributes/document-related/structural/running-titles
Titles repeated each page
-
data-attributes/document-related/visual/text/drop-caps
Drap capitals (large capitals at beginning of paragraph)
-
data-attributes/document-related/visual/text/font/multi-font/font-sizes
More than one font size used
-
data-attributes/document-related/visual/text/font/multi-font/typefaces
More than one typeface used
-
data-attributes/document-related/visual/text/font/typeface/antiqua
Antiqua font (more modern)
-
data-attributes/document-related/visual/text/font/typeface/blackletter
Blackletter, gothic, Fraktur
-
data-attributes/language/mixed
More than one language used
-
granularity/logical/document-related/paragraph
Description coming soon.
-
granularity/physical/document-related/page
Description coming soon.
-
granularity/physical/document-related/region
Region, zone, block
-
granularity/physical/document-related/text-line
Description coming soon.
-
granularity/physical/document-related/word
Word or partial word, if separated by line break, for example
-
platform/platform-independent
Description coming soon.