-
Notifications
You must be signed in to change notification settings - Fork 3
Structure of ALTO files
This document contains a listing of elements and their related attributes in ALTO version 2.0 with values or value sources where applicable. It is an "outline" of the schema, detailed by:
Root <alto> element
Top-level ALTO elements
<Description> elements
<Styles> elements
<Layout> elements
ALTO attributes
ALTO requires use of the element as a child under the root element. The element requires use of a child element, which must carry a valid ID attribute value and a PHYSICAL_IMG_NR attribute value.
The 2.0 schema now has a target namespace URI: http://www.loc.gov/standards/alto/ns-v2#, to reflect that the standard is now maintained by the Library of Congress. The previous namespace URI reflected maintenance by CCS.
alto
- Required: Yes.
- Usage: Root Element for bundling text layout technical metadata.
- Attributes: None.
- Contains AS SEQUENCE: Description, Styles, Layout.
- Contained by: None.
These elements are direct children of the root element. The sorting is based on the accepted sequence in which they may be used.
Description
- Required: No.
- Usage: Describes general settings of the alto file like measurement units and metadata.
- Attributes: None.
- Contains AS SEQUENCE: MeasurementUnit, sourceImageInformation, OCRProcessing.
- Contained by: alto.
Styles
- Required: No.
- Usage: Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements.
- Attributes: None.
- Contains AS SEQUENCE: TextStyle, ParagraphStyle.
- Contained by: alto.
Layout
- Required: Yes.
- Usage: The root Layout element.
- Attributes: STYLEREFS.
- Contains AS SEQUENCE: Page.
- Contained by: alto.
These elements are contained by the element underneath . The sorting is based on the accepted sequence in which they may be used.
MeasurementUnit
- Required: No.
- Usage: All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm.
- Attributes: none.
- Contains ENUMERATED VALUES: dpi, pixel, mm10, inch1200.
- Contained by: Description.
sourceImageInformation
- Required: No.
- Usage: Information to identify the image file from which the OCR text was created.
- Attributes: none.
- Contains SEQUENCE: fileName, fileIdentifier
- Contained by: Description.
OCRProcessing
- Required: No.
- Usage: Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history.
- Attributes: ID.
- Contains: preProcessingStep, ocrProcessingStep, postProcessingStep
- Contained by: Description.
These elements are contained by the element underneath . The sorting is based on the accepted sequence in which they may be used.
TextStyle
- Required: No.
- Usage: A text style defines font properties of text.
- Attributes: ID, FONTWIDTH, FONTTYPE, FONTSTYLE, FONTFAMILY, FONTCOLOR, FONTSIZE.
- Contains: EMPTY ELEMENT.
- Contained by: Styles.
ParagraphStyle
- Required: No.
- Usage: A paragraph style defines formatting properties of text blocks.
- Attributes: ID, RIGHT, LEFT, ALIGN, LINESPACE, FIRSTLINE
- Contains: EMPTY ELEMENT.
- Contained by: Styles.
These elements are contained by the element underneath . The sorting is based on the accepted sequence in which they may be used.
Page
- Required: Yes.
- Usage: One page of a book or journal.
- Attributes: ID, PHYSICAL_IMG_NR, PRINTED_IMG_NR, PAGECLASS, PROCESSING, STYLEREFS, HEIGHT, WIDTH, QUALITY, POSITION.
- Contains SEQUENCE: TopMargin, LeftMargin, RightMargin, BottomMargin, PrintSpace
- Contained by: Layout.
These attributes may appear on given elements within ALTO. The sorting is alphabetical.
ID
- Usage: A valid identifier as defined by the XML Schema specification.
- Contained by: OCRProcessing.
STYLEREFS
- Usage: To bind to IDREFs of various Text* elements.
- Contained by: Layout.