Skip to content

This repo provides a collection of ground truth data. The collection was compiled under different aspects (complexity of the layouts and use of the fonts). The individual data are also characterized by metadata. The metadata is based on the labeling scheme of OCR-D/PrimaLab.

Notifications You must be signed in to change notification settings

tboenig/gt_corpus_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š Corpus

This corpus includes Ground Truth (GT) data compiled considering the following feature:

  1. Classification into font groups: Gothic/Blackletter, Antiqua and FontMix (Antiqua and Blackletter)
    distinction of the selected print type or combinations
  2. Classification into simple and complex
    compelexity of the layout (columns, footnotes,...)

The data are also divided according to the time of creation or production.

πŸ–‰ Creation

The data were created according to the OCR-D Ground Truth Guideline (https://ocr-d.de/en/gt-guidelines/trans/).

πŸ’» Repositories

Analyzed collection

The GT data has been labeled. The labeling is based on an ontology defined by the Pattern Recognition and Image Analysis Research Lab (PRImA-Research-Lab) at the University of Salford. The labeling metadata is created for each available page. The following labeling metadata is available for the different collections.

see: gt-labelling : semantic-labelling OCR ground truth data (https://github.com/OCR-D/gt-labelling)

FontMix (Antiqua and Blackletter)

simple
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations

  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical

    Description coming soon.

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

complex
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

    Part of preceeding or succeeding object included (e.g. other page)

  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)

  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)

  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations

  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/footnote-continued

  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

Gothic/Blackletter

simple
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)

  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)

  • condition/ageing/warping

    Arbitrary warping (e.g. due to moisture)

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical

    Description coming soon.

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

complex
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

    Part of preceeding or succeeding object included (e.g. other page)

  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)

  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)

  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations

  • condition/ageing/warping

    Arbitrary warping (e.g. due to moisture)

  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • condition/wear/additions/informative/stamps

    The medium was stamped

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/composite/music

    Description coming soon.

  • contentOfInterest/visual/graphical

    Description coming soon.

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/decorations

    Decorations of some kind

  • data-attributes/document-related/visual/illustrations

    Illustrations in content

  • data-attributes/document-related/visual/illustrations/multi-colour

    Multi-colour illustrations in content

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

Antiqua

simple
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

complex
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/footnote-continued

  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

About

This repo provides a collection of ground truth data. The collection was compiled under different aspects (complexity of the layouts and use of the fonts). The individual data are also characterized by metadata. The metadata is based on the labeling scheme of OCR-D/PrimaLab.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published