This API (Application Programming Interface) document has pages corresponding to the items in the navigation bar, described as follows.
+
+
+
+
+
Overview
+
The Overview page is the front page of this API document and provides a list of all packages with a summary for each. This page can also contain an overall description of the set of packages.
+
+
+
Package
+
Each package has a page that contains a list of its classes and interfaces, with a summary for each. This page can contain six categories:
+
+
Interfaces (italic)
+
Classes
+
Enums
+
Exceptions
+
Errors
+
Annotation Types
+
+
+
+
Class/Interface
+
Each class, interface, nested class and nested interface has its own separate page. Each of these pages has three sections consisting of a class/interface description, summary tables, and detailed member descriptions:
+
+
Class inheritance diagram
+
Direct Subclasses
+
All Known Subinterfaces
+
All Known Implementing Classes
+
Class/interface declaration
+
Class/interface description
+
+
+
Nested Class Summary
+
Field Summary
+
Constructor Summary
+
Method Summary
+
+
+
Field Detail
+
Constructor Detail
+
Method Detail
+
+
Each summary entry contains the first sentence from the detailed description for that item. The summary entries are alphabetical, while the detailed descriptions are in the order they appear in the source code. This preserves the logical groupings established by the programmer.
+
+
+
Annotation Type
+
Each annotation type has its own separate page with the following sections:
+
+
Annotation Type declaration
+
Annotation Type description
+
Required Element Summary
+
Optional Element Summary
+
Element Detail
+
+
+
+
Enum
+
Each enum has its own separate page with the following sections:
+
+
Enum declaration
+
Enum description
+
Enum Constant Summary
+
Enum Constant Detail
+
+
+
+
Use
+
Each documented package, class and interface has its own Use page. This page describes what packages, classes, methods, constructors and fields use any part of the given class or package. Given a class or interface A, its Use page includes subclasses of A, fields declared as A, methods that return A, and methods and constructors with parameters of type A. You can access this page by first going to the package, class or interface, then clicking on the "Use" link in the navigation bar.
+
+
+
Tree (Class Hierarchy)
+
There is a Class Hierarchy page for all packages, plus a hierarchy for each package. Each hierarchy page contains a list of classes and a list of interfaces. The classes are organized by inheritance structure starting with java.lang.Object. The interfaces do not inherit from java.lang.Object.
+
+
When viewing the Overview page, clicking on "Tree" displays the hierarchy for all packages.
+
When viewing a particular package, class or interface page, clicking "Tree" displays the hierarchy for only that package.
+
+
+
+
Deprecated API
+
The Deprecated API page lists all of the API that have been deprecated. A deprecated API is not recommended for use, generally due to improvements, and a replacement API is usually given. Deprecated APIs may be removed in future implementations.
+
+
+
Index
+
The Index contains an alphabetic list of all classes, interfaces, constructors, methods, and fields.
+
+
+
Prev/Next
+
These links take you to the next or previous class, interface, package, or related page.
+
+
+
Frames/No Frames
+
These links show and hide the HTML frames. All pages are available with or without frames.
+
+
+
All Classes
+
The All Classes link shows all classes and interfaces except non-static nested types.
+
+
+
Serialized Form
+
Each serializable or externalizable class has a description of its serialization fields and methods. This information is of interest to re-implementors, not to developers using the API. While there is no link in the navigation bar, you can get to this information by going to any serialized class and clicking "Serialized Form" in the "See also" section of the class description.
Post-processes the given page object using the specified image dimension and resolution.
+ This could include scaling all coordinates if they are not measured in pixel.
Saves the given document page to an XML file at the specified
+ location, using the latest PAGE XML format or the format the
+ page object has been loaded with.
public static class Page.MeasurementUnit
+extends java.lang.Object
+
Measurement unit for coordinates.
+ Introduced to support ALTO XML files. Use XmlInputOutput.postProcessPage(...)
+ to scale all coordinates using the image information.
PageXmlInputOutput.postProcessPage(Page page,
+ int imageWidth,
+ int imageHeight,
+ double dpiHor,
+ double dpiVert)
+
Post-processes the given page object using the specified image dimension and resolution.
+ This could include scaling all coordinates if they are not measured in pixel.
Saves the given document page to an XML file at the specified
+ location, using the latest PAGE XML format or the format the
+ page object has been loaded with.
public class PageXmlInputOutput
+extends java.lang.Object
+implements org.primaresearch.io.FormatModelSource
+
Central access point for reading and writing PAGE XML.
+
+ Note: Page objects can only be saved using the XML format they are set to.
+ Call Page.setFormatVersion to convert the page object to another version
+ if necessary.
+
+ To validate a page object without writing a file call the validate() method
+ of a PageWriter.
Creates and returns an XML writer for PAGE using the latest schema version.
+
+
+
+
static void
+
postProcessPage(Page page,
+ int imageWidth,
+ int imageHeight,
+ double dpiHor,
+ double dpiVert)
+
Post-processes the given page object using the specified image dimension and resolution.
+ This could include scaling all coordinates if they are not measured in pixel.
Saves the given document page to an XML file at the specified
+ location, using the latest PAGE XML format or the format the
+ page object has been loaded with.
Saves the given document page to an XML file at the specified
+ location, using the latest PAGE XML format or the format the
+ page object has been loaded with.
+
Parameters:
page - Page object
filePath - Target file
+
Throws:
+
org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException - Schema file could not be found.
public static void postProcessPage(Page page,
+ int imageWidth,
+ int imageHeight,
+ double dpiHor,
+ double dpiVert)
+
Post-processes the given page object using the specified image dimension and resolution.
+ This could include scaling all coordinates if they are not measured in pixel.
public class DefaultAttributeFactory
+extends java.lang.Object
+implements AttributeFactory
+
Attribute factory for the default layout content types of PAGE (static/dynamic).
+
+ In static mode types and attributes are hard coded. In dynamic use only the types are hard coded, the attributes are generated dynamically from a schema.
+
+
diff --git a/java/PrimaDla/apidoc/stylesheet.css b/java/PrimaDla/apidoc/stylesheet.css
new file mode 100644
index 00000000..0aeaa97f
--- /dev/null
+++ b/java/PrimaDla/apidoc/stylesheet.css
@@ -0,0 +1,474 @@
+/* Javadoc style sheet */
+/*
+Overall document style
+*/
+body {
+ background-color:#ffffff;
+ color:#353833;
+ font-family:Arial, Helvetica, sans-serif;
+ font-size:76%;
+ margin:0;
+}
+a:link, a:visited {
+ text-decoration:none;
+ color:#4c6b87;
+}
+a:hover, a:focus {
+ text-decoration:none;
+ color:#bb7a2a;
+}
+a:active {
+ text-decoration:none;
+ color:#4c6b87;
+}
+a[name] {
+ color:#353833;
+}
+a[name]:hover {
+ text-decoration:none;
+ color:#353833;
+}
+pre {
+ font-size:1.3em;
+}
+h1 {
+ font-size:1.8em;
+}
+h2 {
+ font-size:1.5em;
+}
+h3 {
+ font-size:1.4em;
+}
+h4 {
+ font-size:1.3em;
+}
+h5 {
+ font-size:1.2em;
+}
+h6 {
+ font-size:1.1em;
+}
+ul {
+ list-style-type:disc;
+}
+code, tt {
+ font-size:1.2em;
+}
+dt code {
+ font-size:1.2em;
+}
+table tr td dt code {
+ font-size:1.2em;
+ vertical-align:top;
+}
+sup {
+ font-size:.6em;
+}
+/*
+Document title and Copyright styles
+*/
+.clear {
+ clear:both;
+ height:0px;
+ overflow:hidden;
+}
+.aboutLanguage {
+ float:right;
+ padding:0px 21px;
+ font-size:.8em;
+ z-index:200;
+ margin-top:-7px;
+}
+.legalCopy {
+ margin-left:.5em;
+}
+.bar a, .bar a:link, .bar a:visited, .bar a:active {
+ color:#FFFFFF;
+ text-decoration:none;
+}
+.bar a:hover, .bar a:focus {
+ color:#bb7a2a;
+}
+.tab {
+ background-color:#0066FF;
+ background-image:url(resources/titlebar.gif);
+ background-position:left top;
+ background-repeat:no-repeat;
+ color:#ffffff;
+ padding:8px;
+ width:5em;
+ font-weight:bold;
+}
+/*
+Navigation bar styles
+*/
+.bar {
+ background-image:url(resources/background.gif);
+ background-repeat:repeat-x;
+ color:#FFFFFF;
+ padding:.8em .5em .4em .8em;
+ height:auto;/*height:1.8em;*/
+ font-size:1em;
+ margin:0;
+}
+.topNav {
+ background-image:url(resources/background.gif);
+ background-repeat:repeat-x;
+ color:#FFFFFF;
+ float:left;
+ padding:0;
+ width:100%;
+ clear:right;
+ height:2.8em;
+ padding-top:10px;
+ overflow:hidden;
+}
+.bottomNav {
+ margin-top:10px;
+ background-image:url(resources/background.gif);
+ background-repeat:repeat-x;
+ color:#FFFFFF;
+ float:left;
+ padding:0;
+ width:100%;
+ clear:right;
+ height:2.8em;
+ padding-top:10px;
+ overflow:hidden;
+}
+.subNav {
+ background-color:#dee3e9;
+ border-bottom:1px solid #9eadc0;
+ float:left;
+ width:100%;
+ overflow:hidden;
+}
+.subNav div {
+ clear:left;
+ float:left;
+ padding:0 0 5px 6px;
+}
+ul.navList, ul.subNavList {
+ float:left;
+ margin:0 25px 0 0;
+ padding:0;
+}
+ul.navList li{
+ list-style:none;
+ float:left;
+ padding:3px 6px;
+}
+ul.subNavList li{
+ list-style:none;
+ float:left;
+ font-size:90%;
+}
+.topNav a:link, .topNav a:active, .topNav a:visited, .bottomNav a:link, .bottomNav a:active, .bottomNav a:visited {
+ color:#FFFFFF;
+ text-decoration:none;
+}
+.topNav a:hover, .bottomNav a:hover {
+ text-decoration:none;
+ color:#bb7a2a;
+}
+.navBarCell1Rev {
+ background-image:url(resources/tab.gif);
+ background-color:#a88834;
+ color:#FFFFFF;
+ margin: auto 5px;
+ border:1px solid #c9aa44;
+}
+/*
+Page header and footer styles
+*/
+.header, .footer {
+ clear:both;
+ margin:0 20px;
+ padding:5px 0 0 0;
+}
+.indexHeader {
+ margin:10px;
+ position:relative;
+}
+.indexHeader h1 {
+ font-size:1.3em;
+}
+.title {
+ color:#2c4557;
+ margin:10px 0;
+}
+.subTitle {
+ margin:5px 0 0 0;
+}
+.header ul {
+ margin:0 0 25px 0;
+ padding:0;
+}
+.footer ul {
+ margin:20px 0 5px 0;
+}
+.header ul li, .footer ul li {
+ list-style:none;
+ font-size:1.2em;
+}
+/*
+Heading styles
+*/
+div.details ul.blockList ul.blockList ul.blockList li.blockList h4, div.details ul.blockList ul.blockList ul.blockListLast li.blockList h4 {
+ background-color:#dee3e9;
+ border-top:1px solid #9eadc0;
+ border-bottom:1px solid #9eadc0;
+ margin:0 0 6px -8px;
+ padding:2px 5px;
+}
+ul.blockList ul.blockList ul.blockList li.blockList h3 {
+ background-color:#dee3e9;
+ border-top:1px solid #9eadc0;
+ border-bottom:1px solid #9eadc0;
+ margin:0 0 6px -8px;
+ padding:2px 5px;
+}
+ul.blockList ul.blockList li.blockList h3 {
+ padding:0;
+ margin:15px 0;
+}
+ul.blockList li.blockList h2 {
+ padding:0px 0 20px 0;
+}
+/*
+Page layout container styles
+*/
+.contentContainer, .sourceContainer, .classUseContainer, .serializedFormContainer, .constantValuesContainer {
+ clear:both;
+ padding:10px 20px;
+ position:relative;
+}
+.indexContainer {
+ margin:10px;
+ position:relative;
+ font-size:1.0em;
+}
+.indexContainer h2 {
+ font-size:1.1em;
+ padding:0 0 3px 0;
+}
+.indexContainer ul {
+ margin:0;
+ padding:0;
+}
+.indexContainer ul li {
+ list-style:none;
+}
+.contentContainer .description dl dt, .contentContainer .details dl dt, .serializedFormContainer dl dt {
+ font-size:1.1em;
+ font-weight:bold;
+ margin:10px 0 0 0;
+ color:#4E4E4E;
+}
+.contentContainer .description dl dd, .contentContainer .details dl dd, .serializedFormContainer dl dd {
+ margin:10px 0 10px 20px;
+}
+.serializedFormContainer dl.nameValue dt {
+ margin-left:1px;
+ font-size:1.1em;
+ display:inline;
+ font-weight:bold;
+}
+.serializedFormContainer dl.nameValue dd {
+ margin:0 0 0 1px;
+ font-size:1.1em;
+ display:inline;
+}
+/*
+List styles
+*/
+ul.horizontal li {
+ display:inline;
+ font-size:0.9em;
+}
+ul.inheritance {
+ margin:0;
+ padding:0;
+}
+ul.inheritance li {
+ display:inline;
+ list-style:none;
+}
+ul.inheritance li ul.inheritance {
+ margin-left:15px;
+ padding-left:15px;
+ padding-top:1px;
+}
+ul.blockList, ul.blockListLast {
+ margin:10px 0 10px 0;
+ padding:0;
+}
+ul.blockList li.blockList, ul.blockListLast li.blockList {
+ list-style:none;
+ margin-bottom:25px;
+}
+ul.blockList ul.blockList li.blockList, ul.blockList ul.blockListLast li.blockList {
+ padding:0px 20px 5px 10px;
+ border:1px solid #9eadc0;
+ background-color:#f9f9f9;
+}
+ul.blockList ul.blockList ul.blockList li.blockList, ul.blockList ul.blockList ul.blockListLast li.blockList {
+ padding:0 0 5px 8px;
+ background-color:#ffffff;
+ border:1px solid #9eadc0;
+ border-top:none;
+}
+ul.blockList ul.blockList ul.blockList ul.blockList li.blockList {
+ margin-left:0;
+ padding-left:0;
+ padding-bottom:15px;
+ border:none;
+ border-bottom:1px solid #9eadc0;
+}
+ul.blockList ul.blockList ul.blockList ul.blockList li.blockListLast {
+ list-style:none;
+ border-bottom:none;
+ padding-bottom:0;
+}
+table tr td dl, table tr td dl dt, table tr td dl dd {
+ margin-top:0;
+ margin-bottom:1px;
+}
+/*
+Table styles
+*/
+.contentContainer table, .classUseContainer table, .constantValuesContainer table {
+ border-bottom:1px solid #9eadc0;
+ width:100%;
+}
+.contentContainer ul li table, .classUseContainer ul li table, .constantValuesContainer ul li table {
+ width:100%;
+}
+.contentContainer .description table, .contentContainer .details table {
+ border-bottom:none;
+}
+.contentContainer ul li table th.colOne, .contentContainer ul li table th.colFirst, .contentContainer ul li table th.colLast, .classUseContainer ul li table th, .constantValuesContainer ul li table th, .contentContainer ul li table td.colOne, .contentContainer ul li table td.colFirst, .contentContainer ul li table td.colLast, .classUseContainer ul li table td, .constantValuesContainer ul li table td{
+ vertical-align:top;
+ padding-right:20px;
+}
+.contentContainer ul li table th.colLast, .classUseContainer ul li table th.colLast,.constantValuesContainer ul li table th.colLast,
+.contentContainer ul li table td.colLast, .classUseContainer ul li table td.colLast,.constantValuesContainer ul li table td.colLast,
+.contentContainer ul li table th.colOne, .classUseContainer ul li table th.colOne,
+.contentContainer ul li table td.colOne, .classUseContainer ul li table td.colOne {
+ padding-right:3px;
+}
+.overviewSummary caption, .packageSummary caption, .contentContainer ul.blockList li.blockList caption, .summary caption, .classUseContainer caption, .constantValuesContainer caption {
+ position:relative;
+ text-align:left;
+ background-repeat:no-repeat;
+ color:#FFFFFF;
+ font-weight:bold;
+ clear:none;
+ overflow:hidden;
+ padding:0px;
+ margin:0px;
+}
+caption a:link, caption a:hover, caption a:active, caption a:visited {
+ color:#FFFFFF;
+}
+.overviewSummary caption span, .packageSummary caption span, .contentContainer ul.blockList li.blockList caption span, .summary caption span, .classUseContainer caption span, .constantValuesContainer caption span {
+ white-space:nowrap;
+ padding-top:8px;
+ padding-left:8px;
+ display:block;
+ float:left;
+ background-image:url(resources/titlebar.gif);
+ height:18px;
+}
+.overviewSummary .tabEnd, .packageSummary .tabEnd, .contentContainer ul.blockList li.blockList .tabEnd, .summary .tabEnd, .classUseContainer .tabEnd, .constantValuesContainer .tabEnd {
+ width:10px;
+ background-image:url(resources/titlebar_end.gif);
+ background-repeat:no-repeat;
+ background-position:top right;
+ position:relative;
+ float:left;
+}
+ul.blockList ul.blockList li.blockList table {
+ margin:0 0 12px 0px;
+ width:100%;
+}
+.tableSubHeadingColor {
+ background-color: #EEEEFF;
+}
+.altColor {
+ background-color:#eeeeef;
+}
+.rowColor {
+ background-color:#ffffff;
+}
+.overviewSummary td, .packageSummary td, .contentContainer ul.blockList li.blockList td, .summary td, .classUseContainer td, .constantValuesContainer td {
+ text-align:left;
+ padding:3px 3px 3px 7px;
+}
+th.colFirst, th.colLast, th.colOne, .constantValuesContainer th {
+ background:#dee3e9;
+ border-top:1px solid #9eadc0;
+ border-bottom:1px solid #9eadc0;
+ text-align:left;
+ padding:3px 3px 3px 7px;
+}
+td.colOne a:link, td.colOne a:active, td.colOne a:visited, td.colOne a:hover, td.colFirst a:link, td.colFirst a:active, td.colFirst a:visited, td.colFirst a:hover, td.colLast a:link, td.colLast a:active, td.colLast a:visited, td.colLast a:hover, .constantValuesContainer td a:link, .constantValuesContainer td a:active, .constantValuesContainer td a:visited, .constantValuesContainer td a:hover {
+ font-weight:bold;
+}
+td.colFirst, th.colFirst {
+ border-left:1px solid #9eadc0;
+ white-space:nowrap;
+}
+td.colLast, th.colLast {
+ border-right:1px solid #9eadc0;
+}
+td.colOne, th.colOne {
+ border-right:1px solid #9eadc0;
+ border-left:1px solid #9eadc0;
+}
+table.overviewSummary {
+ padding:0px;
+ margin-left:0px;
+}
+table.overviewSummary td.colFirst, table.overviewSummary th.colFirst,
+table.overviewSummary td.colOne, table.overviewSummary th.colOne {
+ width:25%;
+ vertical-align:middle;
+}
+table.packageSummary td.colFirst, table.overviewSummary th.colFirst {
+ width:25%;
+ vertical-align:middle;
+}
+/*
+Content styles
+*/
+.description pre {
+ margin-top:0;
+}
+.deprecatedContent {
+ margin:0;
+ padding:10px 0;
+}
+.docSummary {
+ padding:0;
+}
+/*
+Formatting effect styles
+*/
+.sourceLineNo {
+ color:green;
+ padding:0 30px 0 0;
+}
+h1.hidden {
+ visibility:hidden;
+ overflow:hidden;
+ font-size:.9em;
+}
+.block {
+ display:block;
+ margin:3px 0 0 0;
+}
+.strong {
+ font-weight:bold;
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/PrimaDla.gwt.xml b/java/PrimaDla/src/org/primaresearch/dla/PrimaDla.gwt.xml
new file mode 100644
index 00000000..649a8a19
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/PrimaDla.gwt.xml
@@ -0,0 +1,26 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/MetaData.java b/java/PrimaDla/src/org/primaresearch/dla/page/MetaData.java
new file mode 100644
index 00000000..5d380aba
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/MetaData.java
@@ -0,0 +1,119 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page;
+
+import java.io.Serializable;
+import java.text.DateFormat;
+import java.text.SimpleDateFormat;
+import java.util.Date;
+
+/**
+ * Class for document metadata such as creation date, comments, ...
+ *
+ * @author Christian Clausner
+ *
+ */
+public class MetaData implements Serializable {
+
+ private static final long serialVersionUID = 1L;
+
+ public static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss"); //TODO cannot be used in GWT projects!
+
+ private String creator = null;
+ private long created = 0L;
+ private long lastModified = 0L;
+ private String comments = null;
+
+ /**
+ * Returns the creating person, institution, and/or tool
+ * @return Creator description
+ */
+ public String getCreator() {
+ return creator;
+ }
+
+ /**
+ * Sets the creating person, institution, and/or tool
+ * @param creator Creator description
+ */
+ public void setCreator(String creator) {
+ this.creator = creator;
+ }
+
+ /**
+ * Returns comments (generic)
+ * @return Comments text
+ */
+ public String getComments() {
+ return comments;
+ }
+
+ /**
+ * Sets generic comments
+ * @param comments Comments text
+ */
+ public void setComments(String comments) {
+ this.comments = comments;
+ }
+
+ /**
+ * Returns the creation date/time
+ * @return Date and time formatted according to the DATE_FORMAT constant
+ */
+ public String getFormattedCreationTime() {
+ return DATE_FORMAT.format(new Date(created));
+ }
+
+ /**
+ * Returns the creation date/time
+ * @return Date object
+ */
+ public Date getCreationTime() {
+ return new Date(created);
+ }
+
+ /**
+ * Returns the modification date/time
+ * @return Date and time formatted according to the DATE_FORMAT constant
+ */
+ public String getFormattedLastModificationTime() {
+ return DATE_FORMAT.format(new Date(lastModified));
+ }
+
+ /**
+ * Returns the modification date/time
+ * @return Date object
+ */
+ public Date getLastModificationTime() {
+ return new Date(lastModified);
+ }
+
+ /**
+ * Sets the creation date/time
+ * @param d Date object
+ */
+ public void setCreationTime(Date d) {
+ created = d.getTime();
+ }
+
+ /**
+ * Sets the modification date/time
+ * @param d Date object
+ */
+ public void setLastModifiedTime(Date d) {
+ lastModified = d.getTime();
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/Page.java b/java/PrimaDla/src/org/primaresearch/dla/page/Page.java
new file mode 100644
index 00000000..310119bc
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/Page.java
@@ -0,0 +1,268 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.primaresearch.dla.page.io.xml.PageXmlInputOutput;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.converter.ConversionMessage;
+import org.primaresearch.dla.page.layout.converter.ConverterHub;
+import org.primaresearch.dla.page.layout.physical.AttributeContainer;
+import org.primaresearch.dla.page.layout.physical.AttributeFactory;
+import org.primaresearch.dla.page.layout.physical.ContentFactory;
+import org.primaresearch.dla.page.layout.physical.DefaultAttributeFactory;
+import org.primaresearch.dla.page.layout.physical.shared.ContentType;
+import org.primaresearch.ident.Id;
+import org.primaresearch.ident.IdRegister;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.ident.XmlIdRegister;
+import org.primaresearch.io.FormatModel;
+import org.primaresearch.io.FormatVersion;
+import org.primaresearch.shared.variable.VariableMap;
+
+/**
+ * Central class representing one page of a document (e.g. a book page).
+ *
+ * @author Christian Clausner
+ */
+public class Page implements AttributeContainer {
+
+ private PageLayout layout;
+ private ContentFactory contentFactory;
+ private IdRegister idRegister;
+ private MetaData metaData;
+ private String imageFilename;
+ private Id gtsId = null;
+ private FormatVersion formatVersion = null;
+
+ private VariableMap attributes;
+
+ private List alternativeImages;
+
+ private MeasurementUnit measurementUnit = MeasurementUnit.PIXEL;
+
+
+ /**
+ * Returns the version of the page format.
+ * @return Version object
+ */
+ public FormatVersion getFormatVersion() {
+ return formatVersion;
+ }
+
+ /**
+ * Converts the page to the specified format.
+ * Note that this might change the page layout.
+ */
+ public List setFormatVersion(FormatModel formatModel) {
+ contentFactory.setAttributeFactory(createAttributeFactory(formatModel));
+ List ret = ConverterHub.convert(this, formatModel);
+ this.formatVersion = formatModel.getVersion();
+ return ret;
+ }
+
+ private AttributeFactory createAttributeFactory(FormatModel formatModel) {
+ AttributeFactory attribFactory = null;
+ if (formatModel != null) {
+ attribFactory = new DefaultAttributeFactory(formatModel);
+ } else {
+ attribFactory = new DefaultAttributeFactory();
+ }
+ return attribFactory;
+ }
+
+ /**
+ * Returns the main image file that is associated with this page.
+ * @return Filename
+ */
+ public String getImageFilename() {
+ return imageFilename;
+ }
+
+ /**
+ * Sets the main image file that is associated with this page.
+ * @param imageFilename Filename (without path)
+ */
+ public void setImageFilename(String imageFilename) {
+ this.imageFilename = imageFilename;
+ }
+
+ /**
+ * Constructor using the default page format.
+ */
+ public Page() {
+ this(PageXmlInputOutput.getLatestSchemaModel());
+ }
+
+ /**
+ * Constructor using dynamic page format.
+ * @param formatModel Model for dynamic format
+ */
+ public Page(FormatModel formatModel) {
+ this.idRegister = new XmlIdRegister();
+ this.formatVersion = formatModel.getVersion();
+ AttributeFactory attrFactory = createAttributeFactory(formatModel);
+ contentFactory = new ContentFactory(idRegister, attrFactory);
+ layout = new PageLayout(contentFactory);
+ metaData = new MetaData();
+ attributes = attrFactory.createAttributes(ContentType.Page);
+ }
+
+ /**
+ * Returns the page layout
+ * @return Layout object
+ */
+ public PageLayout getLayout() {
+ return layout;
+ }
+
+ /**
+ * Returns the page metadata
+ * @return Metadata object
+ */
+ public MetaData getMetaData() {
+ return metaData;
+ }
+
+ /**
+ * Returns the ground truth and storage ID of this page
+ * @return ID object
+ */
+ public Id getGtsId() {
+ return gtsId;
+ }
+
+ /**
+ * Sets the ground truth and storage ID of this page
+ * @param gtsId ID object
+ * @throws InvalidIdException ID is being used already (must be unique)
+ */
+ public void setGtsId(Id gtsId) throws InvalidIdException {
+ idRegister.registerId(gtsId, this.gtsId);
+ this.gtsId = gtsId;
+ }
+
+ /**
+ * Sets the ground truth and storage ID of this page
+ * @param gtsId ID text
+ * @throws InvalidIdException ID is being used already (must be unique) or the format is invalid
+ */
+ public void setGtsId(String gtsId) throws InvalidIdException {
+ this.gtsId = idRegister.registerId(gtsId, this.gtsId);
+ }
+
+ @Override
+ public VariableMap getAttributes() {
+ return attributes;
+ }
+
+ /**
+ * Returns a list of alternative images that are associated with this page (e.g. bilevel/bitonal/black-and-white image)
+ * @return List with image objects
+ */
+ public List getAlternativeImages() {
+ if (alternativeImages == null)
+ alternativeImages = new ArrayList();
+ return alternativeImages;
+ }
+
+
+
+ /**
+ * Alternative document page image (e.g. black-and-white or grey level)
+ *
+ * @author Christian Clausner
+ *
+ */
+ public static final class AlternativeImage {
+ private String filename;
+ private String comments;
+
+ public AlternativeImage(String filename) {
+ this.filename = filename;
+ }
+
+ public String getFilename() {
+ return filename;
+ }
+ public void setFilename(String filename) {
+ this.filename = filename;
+ }
+ public String getComments() {
+ return comments;
+ }
+ public void setComments(String comments) {
+ this.comments = comments;
+ }
+ }
+
+ /**
+ * Returns the measurement unit for coordinates
+ * @return Current unit
+ */
+ public MeasurementUnit getMeasurementUnit() {
+ return measurementUnit;
+ }
+
+ /**
+ * Sets the measurement unit for coordinates
+ * @param unit Unit object
+ */
+ public void setMeasurementUnit(MeasurementUnit unit) {
+ this.measurementUnit = unit;
+ }
+
+
+
+
+ /**
+ * Measurement unit for coordinates.
+ * Introduced to support ALTO XML files. Use XmlInputOutput.postProcessPage(...)
+ * to scale all coordinates using the image information.
+ */
+ public static class MeasurementUnit {
+ public static final MeasurementUnit PIXEL = new MeasurementUnit("pixel", 0.0);
+ /** One tenth of a mm */
+ public static final MeasurementUnit MM_BY_10 = new MeasurementUnit("mm10", 254.0);
+ /** 1200th of an inch */
+ public static final MeasurementUnit INCH_BY_1200 = new MeasurementUnit("inch1200", 1200.0);
+
+ private String name;
+ private double discreteValuesPerInch;
+
+ public MeasurementUnit(String name, double discreteValuesPerInch) {
+ this.name = name;
+ this.discreteValuesPerInch = discreteValuesPerInch;
+ }
+
+ public double getDiscreteValuesPerInch() {
+ return discreteValuesPerInch;
+ }
+
+ public String getName() {
+ return name;
+ }
+
+ @Override
+ public boolean equals(Object other) {
+ if (other == null || !(other instanceof MeasurementUnit))
+ return false;
+ return name.equals(((MeasurementUnit)other).getName());
+ }
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/FileInput.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/FileInput.java
new file mode 100644
index 00000000..d44e2e67
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/FileInput.java
@@ -0,0 +1,38 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+import java.io.File;
+
+/**
+ * Input source implementation for files
+ *
+ * @author Christian Clausner
+ *
+ */
+public class FileInput implements InputSource {
+
+
+ private File file;
+
+ public FileInput(File file) {
+ this.file = file;
+ }
+
+ public File getFile() {
+ return file;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/FileTarget.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/FileTarget.java
new file mode 100644
index 00000000..c4b0bad5
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/FileTarget.java
@@ -0,0 +1,38 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+import java.io.File;
+
+/**
+ * OutputTarget implementation for files.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class FileTarget implements OutputTarget {
+
+ private File file;
+
+ public FileTarget(File file) {
+ this.file = file;
+ }
+
+ public File getFile() {
+ return file;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/InputSource.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/InputSource.java
new file mode 100644
index 00000000..aabc69f5
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/InputSource.java
@@ -0,0 +1,26 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+/**
+ * Interface for input sources such as files or URLs.
+ *
+ * @author Christian Clausner
+ *
+ */
+public interface InputSource {
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/OutputTarget.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/OutputTarget.java
new file mode 100644
index 00000000..b1172c85
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/OutputTarget.java
@@ -0,0 +1,26 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+/**
+ * Interface for output targets (e.g. files).
+ *
+ * @author Christian Clausner
+ *
+ */
+public interface OutputTarget {
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/PageReader.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/PageReader.java
new file mode 100644
index 00000000..e4a74ddd
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/PageReader.java
@@ -0,0 +1,38 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.io.UnsupportedFormatVersionException;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+
+/**
+ * Reader interface for PAGE.
+ *
+ * @author Christian Clausner
+ *
+ */
+public interface PageReader {
+
+ /**
+ * Reads a PAGE input source and returns a Page object.
+ * @param source Input source of some kind (e.g. FileInput).
+ * @return Page object
+ * @throws UnsupportedSchemaVersionException
+ */
+ public Page read(InputSource source) throws UnsupportedFormatVersionException;
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/PageWriter.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/PageWriter.java
new file mode 100644
index 00000000..3743ce7f
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/PageWriter.java
@@ -0,0 +1,43 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.io.UnsupportedFormatVersionException;
+
+/**
+ * Interface for writing PAGE.
+ *
+ * @author Christian Clausner
+ *
+ */
+public interface PageWriter {
+
+ /**
+ * Writes the given Page object to an output target.
+ *
+ * @return Returns true if written successfully, false otherwise.
+ */
+ public boolean write(Page page, OutputTarget target) throws UnsupportedFormatVersionException;
+
+ /**
+ * Validates the given page object against the format it is set to.
+ *
+ * @return Returns true if valid, false otherwise.
+ */
+ public boolean validate(Page page) throws UnsupportedFormatVersionException;
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/UrlInput.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/UrlInput.java
new file mode 100644
index 00000000..7531854e
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/UrlInput.java
@@ -0,0 +1,38 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io;
+
+import java.net.URL;
+
+/**
+ * InputSouce implementation for URLs.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class UrlInput implements InputSource {
+
+ private URL url;
+
+ public UrlInput(URL url) {
+ this.url = url;
+ }
+
+ public URL getUrl() {
+ return url;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/DefaultXmlNames.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/DefaultXmlNames.java
new file mode 100644
index 00000000..95de0c91
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/DefaultXmlNames.java
@@ -0,0 +1,169 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import org.primaresearch.dla.page.layout.physical.shared.ContentType;
+import org.primaresearch.dla.page.layout.physical.shared.LowLevelTextType;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+
+/**
+ * Class containing hard coded XML element and attribute names for the PAGE format.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class DefaultXmlNames implements XmlNameProvider {
+ public static final String ELEMENT_PcGts = "PcGts";
+ public static final String ELEMENT_Page = "Page";
+ public static final String ELEMENT_TextRegion = "TextRegion";
+ public static final String ELEMENT_ImageRegion = "ImageRegion";
+ public static final String ELEMENT_LineDrawingRegion = "LineDrawingRegion";
+ public static final String ELEMENT_GraphicRegion = "GraphicRegion";
+ public static final String ELEMENT_TableRegion = "TableRegion";
+ public static final String ELEMENT_ChartRegion = "ChartRegion";
+ public static final String ELEMENT_SeparatorRegion = "SeparatorRegion";
+ public static final String ELEMENT_MathsRegion = "MathsRegion";
+ public static final String ELEMENT_NoiseRegion = "NoiseRegion";
+ public static final String ELEMENT_FrameRegion = "FrameRegion";
+ public static final String ELEMENT_UnknownRegion = "UnknownRegion";
+ public static final String ELEMENT_AdvertRegion = "AdvertRegion";
+ public static final String ELEMENT_ChemRegion = "ChemRegion";
+ public static final String ELEMENT_MusicRegion = "MusicRegion";
+
+ public static final String ELEMENT_Border = "Border";
+ public static final String ELEMENT_ReadingOrder = "ReadingOrder";
+ public static final String ELEMENT_RegionRef = "RegionRef";
+ public static final String ELEMENT_UnorderedGroup = "UnorderedGroup";
+ public static final String ELEMENT_OrderedGroup = "OrderedGroup";
+ public static final String ELEMENT_RegionRefIndexed = "RegionRefIndexed";
+ public static final String ELEMENT_UnorderedGroupIndexed = "UnorderedGroupIndexed";
+ public static final String ELEMENT_OrderedGroupIndexed = "OrderedGroupIndexed";
+ public static final String ELEMENT_Layers = "Layers";
+ public static final String ELEMENT_Layer = "Layer";
+ public static final String ELEMENT_PrintSpace = "PrintSpace";
+
+ public static final String ELEMENT_Coords = "Coords";
+ public static final String ELEMENT_Point = "Point";
+ public static final String ELEMENT_TextEquiv = "TextEquiv";
+ public static final String ELEMENT_TextLine = "TextLine";
+ public static final String ELEMENT_Word = "Word";
+ public static final String ELEMENT_Glyph = "Glyph";
+ public static final String ELEMENT_PlainText = "PlainText";
+ public static final String ELEMENT_Unicode = "Unicode";
+ public static final String ELEMENT_Baseline = "Baseline";
+
+ public static final String ELEMENT_Metadata = "Metadata";
+ public static final String ELEMENT_Creator = "Creator";
+ public static final String ELEMENT_Created = "Created";
+ public static final String ELEMENT_LastChange = "LastChange";
+ public static final String ELEMENT_Comments = "Comments";
+
+ public static final String ELEMENT_AlternativeImage = "AlternativeImage";
+ public static final String ELEMENT_Relations = "Relations";
+ public static final String ELEMENT_Relation = "Relation";
+ public static final String ELEMENT_TextStyle = "TextStyle";
+
+
+ public static final String ATTR_pcGtsId = "pcGtsId";
+ public static final String ATTR_imageFilename = "imageFilename";
+ public static final String ATTR_imageWidth = "imageWidth";
+ public static final String ATTR_imageHeight = "imageHeight";
+ public static final String ATTR_id = "id";
+ public static final String ATTR_x = "x";
+ public static final String ATTR_y = "y";
+ public static final String ATTR_orientation = "orientation";
+ public static final String ATTR_readingOrientation = "readingOrientation";
+ public static final String ATTR_readingDirection = "readingDirection";
+ public static final String ATTR_leading = "leading";
+ public static final String ATTR_kerning = "kerning";
+ public static final String ATTR_fontSize = "fontSize";
+ public static final String ATTR_type = "type";
+ public static final String ATTR_textColour = "textColour";
+ public static final String ATTR_bgColour = "bgColour";
+ public static final String ATTR_reverseVideo = "reverseVideo";
+ public static final String ATTR_indented = "indented";
+ public static final String ATTR_primaryLanguage = "primaryLanguage";
+ public static final String ATTR_secondaryLanguage = "secondaryLanguage";
+ public static final String ATTR_language = "language";
+ public static final String ATTR_primaryScript = "primaryScript";
+ public static final String ATTR_secondaryScript = "secondaryScript";
+ public static final String ATTR_colourDepth = "colourDepth";
+ public static final String ATTR_embText = "embText";
+ public static final String ATTR_penColour = "penColour";
+ public static final String ATTR_numColours = "numColours";
+ public static final String ATTR_rows = "rows";
+ public static final String ATTR_columns = "columns";
+ public static final String ATTR_lineColour = "lineColour";
+ public static final String ATTR_lineSeparators = "lineSeparators";
+ public static final String ATTR_colour = "colour";
+ public static final String ATTR_borderPresent = "borderPresent";
+ public static final String ATTR_symbol = "symbol";
+ public static final String ATTR_ligature = "ligature";
+ public static final String ATTR_regionRef = "regionRef";
+ public static final String ATTR_index = "index";
+ public static final String ATTR_zIndex = "zIndex";
+ public static final String ATTR_points = "points";
+ public static final String ATTR_caption = "caption";
+ public static final String ATTR_conf = "conf";
+ public static final String ATTR_custom = "custom";
+ public static final String ATTR_comments = "comments";
+ public static final String ATTR_filename = "filename";
+ public static final String ATTR_bold = "bold";
+ public static final String ATTR_italic = "italic";
+ public static final String ATTR_underlined = "underlined";
+ public static final String ATTR_strikethrough = "strikethrough";
+ public static final String ATTR_subscript = "subscript";
+ public static final String ATTR_superscript = "superscript";
+ public static final String ATTR_smallCaps = "smallCaps";
+ public static final String ATTR_letterSpaced = "letterSpaced";
+
+ @Override
+ public String getXmlName(ContentType type) {
+ if (type == RegionType.ChartRegion)
+ return ELEMENT_ChartRegion;
+ if (type == RegionType.GraphicRegion)
+ return ELEMENT_GraphicRegion;
+ if (type == RegionType.ImageRegion)
+ return ELEMENT_ImageRegion;
+ if (type == RegionType.LineDrawingRegion)
+ return ELEMENT_LineDrawingRegion;
+ if (type == RegionType.MathsRegion)
+ return ELEMENT_MathsRegion;
+ if (type == RegionType.NoiseRegion)
+ return ELEMENT_NoiseRegion;
+ if (type == RegionType.SeparatorRegion)
+ return ELEMENT_SeparatorRegion;
+ if (type == RegionType.AdvertRegion)
+ return ELEMENT_AdvertRegion;
+ if (type == RegionType.ChemRegion)
+ return ELEMENT_ChemRegion;
+ if (type == RegionType.MusicRegion)
+ return ELEMENT_MusicRegion;
+ if (type == RegionType.TableRegion)
+ return ELEMENT_TableRegion;
+ if (type == RegionType.TextRegion)
+ return ELEMENT_TextRegion;
+ if (type == RegionType.UnknownRegion)
+ return ELEMENT_UnknownRegion;
+ if (type == LowLevelTextType.TextLine)
+ return ELEMENT_TextLine;
+ if (type == LowLevelTextType.Word)
+ return ELEMENT_Word;
+ if (type == LowLevelTextType.Glyph)
+ return ELEMENT_Glyph;
+ return type.getName();
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/MetsMultiPageReader.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/MetsMultiPageReader.java
new file mode 100644
index 00000000..2244bdb2
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/MetsMultiPageReader.java
@@ -0,0 +1,168 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+
+import org.primaresearch.dla.page.io.FileInput;
+import org.primaresearch.dla.page.io.InputSource;
+import org.primaresearch.dla.page.io.UrlInput;
+import org.primaresearch.io.xml.IOError;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+import org.xml.sax.helpers.DefaultHandler;
+
+
+/**
+ * Reader for multiple page file defined in METS XML format.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class MetsMultiPageReader {
+
+ private SAXParser parser;
+ private SaxMetsHandler metsHandler = null;
+ private PageErrorHandler lastErrors;
+
+ public MetsMultiPageReader() {
+ createParser();
+ }
+
+ public List read(InputSource source) {
+
+ lastErrors = new PageErrorHandler();
+
+ parse(source, lastErrors);
+
+ List pageFiles = null;
+
+ if (!lastErrors.hasErrors())
+ pageFiles = metsHandler.getPageFiles();
+
+ return pageFiles;
+ }
+
+ /**
+ * Parses a METS file
+ */
+ private void parse(InputSource input, PageErrorHandler errorHandler) {
+ try{
+ XMLReader reader = parser.getXMLReader();
+ reader.setErrorHandler(errorHandler);
+ reader.setContentHandler(metsHandler);
+ InputStream inputStream = getInputStream(input);
+ if (inputStream == null)
+ return;
+ org.xml.sax.InputSource saxInput = new org.xml.sax.InputSource(inputStream);
+ //saxInput.setEncoding("utf-8");
+ reader.parse(saxInput);
+ } catch (Throwable t) {
+ t.printStackTrace();
+ }
+ }
+
+ /**
+ * Creates the SAX parser for METS XML.
+ */
+ private void createParser() {
+ try {
+ // Obtain a new instance of a SAXParserFactory.
+ SAXParserFactory factory = SAXParserFactory.newInstance();
+ // Specifies that the parser produced by this code will provide support for XML namespaces.
+ factory.setNamespaceAware(true);
+ factory.setValidating(false);
+
+ this.metsHandler = new SaxMetsHandler();
+
+ // Creates a new instance of a SAXParser using the currently configured factory parameters.
+ parser = factory.newSAXParser();
+
+ } catch (Throwable t) {
+ t.printStackTrace();
+ }
+ }
+
+ private InputStream getInputStream(InputSource source) {
+ if (source instanceof FileInput) {
+ File f = ((FileInput)source).getFile();
+ try {
+ return new FileInputStream(f);
+ } catch (FileNotFoundException e) {
+ e.printStackTrace();
+ lastErrors.getErrors().add(new IOError("Could not open stream from file: "+e.getMessage()));
+ }
+ }
+ else if (source instanceof UrlInput) {
+ try {
+ return ((UrlInput)source).getUrl().openStream();
+ } catch (IOException e) {
+ e.printStackTrace();
+ lastErrors.getErrors().add(new IOError("Could not open stream from URL: "+e.getMessage()));
+ }
+ }
+ else
+ throw new IllegalArgumentException("Only FileInput and UrlInput allowed for MetsMultiPageReader");
+ return null;
+ }
+
+
+ /**
+ * SAX handler implementation to parse METS.
+ *
+ * @author Christian Clausner
+ */
+ private static class SaxMetsHandler extends DefaultHandler {
+
+ private static final String ELEMENT_FLocat = "FLocat";
+ private static final String ATTR_href = "xlink:href";
+
+ private List pageFiles = new ArrayList();
+
+ public List getPageFiles() {
+ return pageFiles;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ if (ELEMENT_FLocat.equals(localName)){
+ int i;
+ if ((i = atts.getIndex(ATTR_href)) >= 0) {
+ pageFiles.add(atts.getValue(i));
+ }
+ }
+ }
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/PageErrorHandler.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/PageErrorHandler.java
new file mode 100644
index 00000000..6eab334b
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/PageErrorHandler.java
@@ -0,0 +1,85 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.primaresearch.io.xml.IOError;
+import org.primaresearch.io.xml.XmlValidationError;
+import org.xml.sax.ErrorHandler;
+import org.xml.sax.SAXException;
+import org.xml.sax.SAXParseException;
+
+/**
+ * Error handler implementation that collects errors and warnings.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class PageErrorHandler implements ErrorHandler {
+
+ List errors = new ArrayList();
+ List warnings = new ArrayList();
+
+ @Override
+ public void error(SAXParseException exc) throws SAXException {
+ errors.add(new XmlValidationError(exc.getMessage(), "Line "+exc.getLineNumber()+", Column: "+exc.getColumnNumber()));
+ }
+
+ @Override
+ public void fatalError(SAXParseException exc) throws SAXException {
+ errors.add(new XmlValidationError(exc.getMessage(), "Line "+exc.getLineNumber()+", Column: "+exc.getColumnNumber()));
+ }
+
+ @Override
+ public void warning(SAXParseException exc) throws SAXException {
+ warnings.add(new XmlValidationError(exc.getMessage(), "Line "+exc.getLineNumber()+", Column: "+exc.getColumnNumber()));
+ }
+
+ /**
+ * Checks if there were errors
+ * @return true if errors were registered
+ */
+ public boolean hasErrors() {
+ return !errors.isEmpty();
+ }
+
+ /**
+ * Checks if there were warnings
+ * @return true if warnings were registered
+ */
+ public boolean hasWarnings() {
+ return !warnings.isEmpty();
+ }
+
+ /**
+ * Returns all registered errors
+ * @return List of error objects
+ */
+ public List getErrors() {
+ return errors;
+ }
+
+ /**
+ * Returns all registered warnings
+ * @return List of warning objects
+ */
+ public List getWarnings() {
+ return warnings;
+ }
+
+}
\ No newline at end of file
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/PageXmlInputOutput.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/PageXmlInputOutput.java
new file mode 100644
index 00000000..95752a64
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/PageXmlInputOutput.java
@@ -0,0 +1,355 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.io.File;
+import java.net.URL;
+
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.Page.MeasurementUnit;
+import org.primaresearch.dla.page.io.FileInput;
+import org.primaresearch.dla.page.io.FileTarget;
+import org.primaresearch.dla.page.io.UrlInput;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.ContentIterator;
+import org.primaresearch.dla.page.layout.physical.shared.ContentType;
+import org.primaresearch.dla.page.layout.physical.shared.LowLevelTextType;
+import org.primaresearch.io.FormatModel;
+import org.primaresearch.io.FormatModelSource;
+import org.primaresearch.io.FormatVersion;
+import org.primaresearch.io.UnsupportedFormatVersionException;
+import org.primaresearch.io.xml.SchemaModelParser;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlValidator;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.NoSchemasException;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+import org.primaresearch.maths.geometry.Point;
+import org.primaresearch.maths.geometry.Polygon;
+
+/**
+ * Central access point for reading and writing PAGE XML.
+ *
+ * Note: Page objects can only be saved using the XML format they are set to.
+ * Call Page.setFormatVersion to convert the page object to another version
+ * if necessary.
+ *
+ * To validate a page object without writing a file call the validate() method
+ * of a PageWriter.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class PageXmlInputOutput implements FormatModelSource {
+
+ private static PageXmlInputOutput instance = null;
+ private XmlModelAndValidatorProvider validatorProvider;
+
+ /**
+ * Constructor (private because this is a singleton).
+ */
+ private PageXmlInputOutput() {
+ try {
+ validatorProvider = new PageXmlModelAndValidatorProvider(); //Provider supporting the default schemas
+ } catch (NoSchemasException e) {
+ e.printStackTrace();
+ }
+ }
+
+ /**
+ * Returns the instance of the singleton.
+ */
+ public static PageXmlInputOutput getInstance() {
+ if (instance == null)
+ instance = new PageXmlInputOutput();
+ return instance;
+ }
+
+ /**
+ * Returns the validator provider of the singleton.
+ */
+ private static XmlModelAndValidatorProvider getValidatorProvider() {
+ return getInstance().validatorProvider;
+ }
+
+ /**
+ * Sets the validator provider of the singleton.
+ */
+ private static void setValidatorProvider(XmlModelAndValidatorProvider provider) {
+ getInstance().validatorProvider = provider;
+ }
+
+ /**
+ * Sets the location of additional schema files and assumes the default schema file name 'pagecontent.xsd'.
+ * @param rootFolder Root of the schema folder structure containing the schema files.
+ * @throws NoSchemasException No schemas found at the given location
+ */
+ public static void setAdditionalSchemaLocation(String rootFolder) throws NoSchemasException {
+ setAdditionalSchemaLocation(rootFolder, "pagecontent.xsd");
+ }
+
+ /**
+ * Sets the location of additional schema files and searches for schemas having the specified name.
+ *
+ * @param rootFolder Root of the schema folder structure containing the schema files
+ * @param schemaFilename Usually a filename with extension .xsd
+ * @throws NoSchemasException No schemas found at the given location
+ */
+ public static void setAdditionalSchemaLocation(String rootFolder, String schemaFilename) throws NoSchemasException {
+ if (rootFolder == null)
+ setValidatorProvider(new PageXmlModelAndValidatorProvider());
+ else
+ setValidatorProvider(new PageXmlModelAndValidatorProvider(rootFolder, schemaFilename));
+ }
+
+ /**
+ * Creates and returns an XML writer for PAGE using the latest schema version.
+ *
+ * @throws UnsupportedSchemaVersionException Schema file could not be found.
+ */
+ public static XmlPageWriter getWriterForLastestXmlFormat(/*boolean validation*/) throws UnsupportedSchemaVersionException {
+ XmlValidator validator = null;
+ //if (validation) {
+ XmlModelAndValidatorProvider validatorProvider = getValidatorProvider();
+ if (validatorProvider != null) {
+ validator = validatorProvider.getValidator(new XmlFormatVersion("2013-07-15"));
+ }
+ //}
+ return new XmlPageWriter_2013_07_15(validator);
+ }
+
+ /**
+ * Creates and returns an XML writer for PAGE using the specified schema version.
+ * This might require the schema location to be set beforehand.
+ *
+ * @throws UnsupportedSchemaVersionException The schema file could not be found.
+ */
+ public static XmlPageWriter getWriter(XmlFormatVersion schemaVersion) throws UnsupportedSchemaVersionException {
+ XmlValidator validator = null;
+
+ XmlModelAndValidatorProvider validatorProvider = getValidatorProvider();
+ if (validatorProvider != null) {
+ validator = validatorProvider.getValidator(schemaVersion);
+ }
+
+ if (new XmlFormatVersion("2013-07-15").equals(schemaVersion))
+ return new XmlPageWriter_2013_07_15(validator);
+
+ //Legacy
+ return new XmlPageWriter_2010_03_19(validator);
+ }
+
+ /**
+ * Saves the given document page to an XML file at the specified
+ * location, using the latest PAGE XML format or the format the
+ * page object has been loaded with.
+ *
+ * @param page Page object
+ * @param filePath Target file
+ * @throws UnsupportedSchemaVersionException Schema file could not be found.
+ */
+ public static boolean writePage(Page page, String filePath/*, boolean validate*/) throws UnsupportedSchemaVersionException {
+ XmlPageWriter writer = null;
+ if (page.getFormatVersion() == null || !(page.getFormatVersion() instanceof XmlFormatVersion))
+ writer = getWriterForLastestXmlFormat(/*validate*/);
+ else {
+ writer = getWriter((XmlFormatVersion)page.getFormatVersion());
+ }
+ try {
+ return writer.write(page, new FileTarget(new File(filePath)));
+ } catch (UnsupportedFormatVersionException e) {
+ e.printStackTrace();
+ }
+ return false;
+ }
+
+ /**
+ * Creates and returns an XML reader for PAGE.
+ */
+ public static XmlPageReader getReader(/*boolean validation*/) {
+ XmlModelAndValidatorProvider validatorProvider = null;
+ //if (validation)
+ validatorProvider = getValidatorProvider();
+ return new XmlPageReader(validatorProvider);
+ }
+
+ /**
+ * Creates a page object from the given XML file (no validation).
+ *
+ * @param filePath Path to PAGE XML file.
+ * @return Page object
+ */
+ //public static Page readPage(String filePath) {
+ // try {
+ // return readPage(filePath, false);
+ // } catch (UnsupportedFormatVersionException e) {
+ // e.printStackTrace(); //Cannot happen
+ // }
+ // return null;
+ //}
+
+ /**
+ * Creates a page object from the given XML file.
+ *
+ * @param filePath Path to PAGE XML file.
+ * @return Page object
+ * @throws UnsupportedSchemaVersionException Schema file not found
+ */
+ public static Page readPage(String filePath/*, boolean validate*/) throws UnsupportedFormatVersionException {
+ XmlPageReader reader = getReader(/*validate*/);
+ return reader.read(new FileInput(new File(filePath)));
+ }
+
+ /**
+ * Creates a page object from the given XML file.
+ *
+ * @param url URL of PAGE XML file.
+ * @return Page object
+ * @throws UnsupportedSchemaVersionException Schema file not found
+ */
+ public static Page readPage(URL url) throws UnsupportedFormatVersionException {
+ XmlPageReader reader = getReader(/*validate*/);
+ return reader.read(new UrlInput(url));
+ }
+
+ /**
+ * Returns the model of the latest XML schema.
+ */
+ public static SchemaModelParser getLatestSchemaModel() {
+ PageXmlInputOutput instance = getInstance();
+ try {
+ return instance.validatorProvider.getSchemaParser(instance.validatorProvider.getLatestSchemaVersion());
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ }
+ return null;
+ }
+
+ /**
+ * Returns the model of the latest XML schema.
+ */
+ public static SchemaModelParser getSchemaModel(XmlFormatVersion version) {
+ PageXmlInputOutput instance = getInstance();
+ try {
+ return instance.validatorProvider.getSchemaParser(version);
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ }
+ return null;
+ }
+
+ @Override
+ public FormatModel getFormatModel(FormatVersion version) throws UnsupportedFormatVersionException {
+ PageXmlInputOutput instance = getInstance();
+ return instance.validatorProvider.getSchemaParser((XmlFormatVersion)version);
+ }
+
+ /**
+ * Post-processes the given page object using the specified image dimension and resolution.
+ * This could include scaling all coordinates if they are not measured in pixel.
+ * @param page Page object to post-process
+ * @param imageWidth Width of the document image
+ * @param imageHeight Height of the document image
+ * @param dpiHor X resolution of the document image
+ * @param dpiVert Y resolution of the document image
+ */
+ public static void postProcessPage(Page page, int imageWidth, int imageHeight, double dpiHor, double dpiVert) {
+ if (page == null || page.getLayout() == null || MeasurementUnit.PIXEL.equals(page.getMeasurementUnit()))
+ return;
+
+ PageLayout layout = page.getLayout();
+
+ double scaleX = 1.0;
+ double scaleY = 1.0;
+
+ double conversion = page.getMeasurementUnit().getDiscreteValuesPerInch();
+
+ if (conversion == 0.0)
+ return;
+
+ //If the image dimensions don't equal the document dimensions, we can asume that the
+ //document width and height are using the measurement unit as well. In that case, we
+ //can calculate the scaling factor from the size difference.
+ if (layout.getWidth() != imageWidth && layout.getHeight() != imageHeight) {
+ //Sanity check: If the page dimensions are in pixel, even though the
+ // measurement unit is not pixel, we need shouldn't use them.
+
+ //Go through all regions a see if they are within the page bounds
+ boolean ok = true;
+ for (ContentIterator it=layout.iterator(null); it.hasNext(); ) {
+ Polygon polygon = it.next().getCoords();
+ if (polygon != null) {
+ for (int i=0; i layout.getWidth() || p.y > layout.getHeight()) {
+ ok = false;
+ break;
+ }
+ }
+ }
+ }
+
+ if (ok) {
+ scaleX = (double)imageWidth / (double)layout.getWidth();
+ scaleY = (double)imageHeight / (double)layout.getHeight();
+ }
+ }
+
+ //Use the image resolution to calculate the scaling factor
+ if (scaleX == 1.0 && scaleY == 1.0)
+ {
+ scaleX = dpiHor / conversion;
+ scaleY = dpiVert / conversion;
+ }
+
+ if (scaleX == 0 || scaleY == 0)
+ return;
+
+ //Now scale all coordinates
+ // Document size
+ layout.setSize(imageWidth, imageHeight);
+ //Region, lines, words, glyphs
+ ContentType types[] = new ContentType[]{null, LowLevelTextType.TextLine, LowLevelTextType.Word, LowLevelTextType.Glyph};
+ for (ContentType tp : types) {
+ for (ContentIterator it=layout.iterator(tp); it.hasNext(); ) {
+ scalePolygon(it.next().getCoords(), scaleX, scaleY);
+ }
+ }
+ //Border, print space
+ if (layout.getBorder() != null)
+ scalePolygon(layout.getBorder().getCoords(), scaleX, scaleY);
+ if (layout.getPrintSpace() != null)
+ scalePolygon(layout.getPrintSpace().getCoords(), scaleX, scaleY);
+ }
+
+ /**
+ * Scales all points of the given polygon
+ * @param polygon Polygon with 2D points
+ * @param scaleX Multiplier for x coordinates
+ * @param scaleY Multiplier for y coordinates
+ */
+ private static void scalePolygon(Polygon polygon, double scaleX, double scaleY) {
+ if (polygon == null)
+ return;
+ for (int i=0; i
+ *
+ * Example:
+ * -schema
+ * -2010-01-12
+ * -pagecontent.xsd
+ * -2010-03-19
+ * -pagecontent.xsd
+ *
+ * @param schemaRootFolder
+ * @param schemaFilename
+ * @throws NoSchemasException No schema files found at the given location.
+ */
+ public PageXmlModelAndValidatorProvider(String schemaRootFolder, String schemaFilename) throws NoSchemasException {
+ super(schemaRootFolder, schemaFilename);
+ }
+
+ /**
+ * Adds the internal default schemas to the list of schema sources.
+ */
+ protected void addDefaultSchemas() {
+ try {
+ //2009-03-16
+ addSchemaSource(new XmlFormatVersion("2009-03-16"),
+ getClass().getResource("/org/primaresearch/dla/page/io/xml/schema/2009-03-16_pagecontent.xsd"),
+ true);
+
+ //2010-01-12
+ addSchemaSource( new XmlFormatVersion("2010-01-12"),
+ getClass().getResource("/org/primaresearch/dla/page/io/xml/schema/2010-01-12_pagecontent.xsd"),
+ true);
+
+ //2010-03-19
+ addSchemaSource( new XmlFormatVersion("2010-03-19"),
+ getClass().getResource("/org/primaresearch/dla/page/io/xml/schema/2010-03-19_pagecontent.xsd"),
+ true);
+
+ //2013-07-15
+ addSchemaSource( new XmlFormatVersion("2013-07-15"),
+ getClass().getResource("/org/primaresearch/dla/page/io/xml/schema/2013-07-15_pagecontent.xsd"),
+ true);
+
+ //Abbyy FineReader 10
+ addSchemaSource( new XmlFormatVersion("http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml"),
+ new URL("http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml"),
+ false);
+
+ //ALTO 2.1
+ addSchemaSource( new XmlFormatVersion("http://www.loc.gov/standards/alto/ns-v2#"),
+ new URL("http://www.loc.gov/standards/alto/alto.xsd"),
+ false);
+
+ //HOCR
+ addSchemaSource( new XmlFormatVersion("HOCR"),
+ null,
+ false);
+ } catch (Exception e) {
+ e.printStackTrace();
+ }
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/StreamTarget.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/StreamTarget.java
new file mode 100644
index 00000000..5a6b30a9
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/StreamTarget.java
@@ -0,0 +1,39 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.io.OutputStream;
+
+import org.primaresearch.dla.page.io.OutputTarget;
+
+/**
+ * Generic output target can be any type of stream.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class StreamTarget implements OutputTarget {
+
+ private OutputStream outputStream;
+
+ public StreamTarget(OutputStream outputStream) {
+ this.outputStream = outputStream;
+ }
+
+ public OutputStream getOutputStream() {
+ return outputStream;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlNameProvider.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlNameProvider.java
new file mode 100644
index 00000000..e38fae0a
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlNameProvider.java
@@ -0,0 +1,30 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import org.primaresearch.dla.page.layout.physical.shared.ContentType;
+
+
+/**
+ * Provides the XML element names for content objects.
+ *
+ * @author Christian Clausner
+ *
+ */
+public interface XmlNameProvider {
+
+ public String getXmlName(ContentType type);
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageReader.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageReader.java
new file mode 100644
index 00000000..24917b97
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageReader.java
@@ -0,0 +1,300 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.List;
+
+import javax.xml.parsers.SAXParser;
+import javax.xml.parsers.SAXParserFactory;
+
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.io.FileInput;
+import org.primaresearch.dla.page.io.InputSource;
+import org.primaresearch.dla.page.io.PageReader;
+import org.primaresearch.dla.page.io.UrlInput;
+import org.primaresearch.dla.page.io.xml.sax.SaxPageHandler;
+import org.primaresearch.dla.page.io.xml.sax.SaxPageHandlerFactory;
+import org.primaresearch.io.UnsupportedFormatVersionException;
+import org.primaresearch.io.xml.IOError;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlValidator;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+import org.xml.sax.XMLReader;
+import org.xml.sax.helpers.DefaultHandler;
+
+/**
+ * Page reader implementation for XML files (supports validation against schema).
+ *
+ * @author Christian Clausner
+ */
+public class XmlPageReader implements PageReader {
+
+ /** Constant for recognising a shortcut out of parsing. */
+ private static final String PARSING_COMPLETE = "PARSING_COMPLETE";
+
+ private SaxPageHandler pageHandler = null;
+ private SAXParser mainParser;
+ private SchemaVersionHandler schemaVersionHandler;
+ private SAXParser schemaVersionParser;
+ private XmlModelAndValidatorProvider validatorProvider;
+ private XmlFormatVersion schemaVersion = null;
+ private PageErrorHandler lastErrors;
+
+ /**
+ * Constructor
+ * @param validatorProvider Schema validator provider. (optional, set to null if no validation required).
+ */
+ public XmlPageReader(XmlModelAndValidatorProvider validatorProvider) {
+ this.validatorProvider = validatorProvider;
+ if (validatorProvider != null) {
+ createSchemaVersionParser();
+ schemaVersionHandler = new SchemaVersionHandler();
+ }
+ try {
+ createMainParser();
+ } catch (UnsupportedFormatVersionException e) {
+ e.printStackTrace(); //Cannot happen here, as we don't have the schema version yet...
+ }
+ }
+
+ /**
+ * Creates the SAX parser for PAGE XML.
+ * @throws UnsupportedFormatVersionException
+ */
+ private void createMainParser() throws UnsupportedFormatVersionException {
+ try {
+ // Obtain a new instance of a SAXParserFactory.
+ SAXParserFactory factory = SAXParserFactory.newInstance();
+ // Specifies that the parser produced by this code will provide support for XML namespaces.
+ factory.setNamespaceAware(true);
+ factory.setValidating(false);
+ //Fix for delay when reading HOCR (disables loading the external DTD that is defined in the HOCR file)
+ factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
+
+ //Validation
+ if (validatorProvider != null && schemaVersion != null) {
+ // Specifies that the parser produced by this code will validate documents as they are parsed.
+ XmlValidator validator = validatorProvider.getValidator(schemaVersion);
+ if (validator != null)
+ factory.setSchema(validator.getSchema());
+ }
+
+ //this.pageHandler = new PageHandler(validatorProvider, schemaVersion);
+ this.pageHandler = SaxPageHandlerFactory.createHandler(validatorProvider, schemaVersion);
+
+ // Creates a new instance of a SAXParser using the currently configured factory parameters.
+ mainParser = factory.newSAXParser();
+
+ } catch (UnsupportedFormatVersionException exc) {
+ throw exc;
+ } catch (Throwable t) {
+ t.printStackTrace();
+ }
+ }
+
+ /**
+ * Creates the parser that finds the schema version only.
+ */
+ private void createSchemaVersionParser() {
+ try {
+ // Obtain a new instance of a SAXParserFactory.
+ SAXParserFactory factory = SAXParserFactory.newInstance();
+ // Specifies that the parser produced by this code will provide support for XML namespaces.
+ factory.setNamespaceAware(true);
+ factory.setValidating(false);
+ //Fix for delay when reading HOCR (disables loading the external DTD that is defined in the HOCR file)
+ factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
+
+ schemaVersionParser = factory.newSAXParser();
+ } catch (Throwable t) {
+ t.printStackTrace();
+ }
+ }
+
+ /**
+ * Reads a PAGE XML file and returns a Page object.
+ *
+ * @param source FileInput representing an XML file
+ * @return Page object or null in case of errors (see getErrors()).
+ * @throws IllegalArgumentException Wrong input source type
+ */
+ @Override
+ public Page read(InputSource source) throws UnsupportedFormatVersionException {
+
+
+ lastErrors = new PageErrorHandler();
+
+ parse(source, lastErrors);
+
+ Page page = null;
+
+ if (!lastErrors.hasErrors())
+ page = pageHandler.getPageObject();
+
+ //if (!MeasurementUnit.PIXEL.equals(pageHandler.getMeasurementUnit()))
+
+
+ return page;
+ }
+
+ private InputStream getInputStream(InputSource source) {
+ if (source instanceof FileInput) {
+ File f = ((FileInput)source).getFile();
+ try {
+ return new FileInputStream(f);
+ } catch (FileNotFoundException e) {
+ e.printStackTrace();
+ lastErrors.getErrors().add(new IOError("Could not open stream from file: "+e.getMessage()));
+ }
+ }
+ else if (source instanceof UrlInput) {
+ try {
+ return ((UrlInput)source).getUrl().openStream();
+ } catch (IOException e) {
+ e.printStackTrace();
+ lastErrors.getErrors().add(new IOError("Could not open stream from URL: "+e.getMessage()));
+ }
+ }
+ else
+ throw new IllegalArgumentException("Only FileInput and UrlInput allowed for XmlPageReader");
+ return null;
+ }
+
+ /**
+ * Returns a list of errors that occurred on the last call of read().
+ */
+ public List getErrors() {
+ return lastErrors != null ? lastErrors.getErrors() : null;
+ }
+
+ /**
+ * Returns a list of warnings that occurred on the last call of read().
+ */
+ public List getWarnings() {
+ return lastErrors != null ? lastErrors.getWarnings() : null;
+ }
+
+ /**
+ * Parses a PAGE file
+ */
+ private void parse(InputSource input, PageErrorHandler errorHandler) throws UnsupportedFormatVersionException {
+ //Validation?
+ if (validatorProvider != null) {
+ try {
+ InputStream inputStream = getInputStream(input);
+ if (inputStream == null)
+ return;
+ schemaVersionParser.parse(inputStream, schemaVersionHandler);
+ //We shortcut the parsing with an exception (see below)
+ } catch (SAXException e) {
+ if (PARSING_COMPLETE.equals(e.getMessage())) { //Shortcut when no more parsing is required.
+ XmlFormatVersion version = schemaVersionHandler.getSchemaVersion();
+ if (version == null || !version.equals(schemaVersion)) {
+ schemaVersion = version;
+ createMainParser(); //If the schema version has changed, we have to create a new parser.
+ }
+ }
+ else
+ e.printStackTrace();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ }
+
+ try{
+ XMLReader reader = mainParser.getXMLReader();
+ reader.setErrorHandler(errorHandler);
+ reader.setContentHandler(pageHandler);
+ InputStream inputStream = getInputStream(input);
+ if (inputStream == null)
+ return;
+ org.xml.sax.InputSource saxInput = new org.xml.sax.InputSource(inputStream);
+ //saxInput.setEncoding("utf-8");
+ reader.parse(saxInput);
+ } catch (Throwable t) {
+ t.printStackTrace();
+ }
+ }
+
+
+
+ /**
+ * SAX handler implementation to parse the schema version only.
+ *
+ * @author Christian Clausner
+ */
+ private static class SchemaVersionHandler extends DefaultHandler {
+ private XmlFormatVersion schemaVersion = null;
+
+ public XmlFormatVersion getSchemaVersion() {
+ return schemaVersion;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ if (DefaultXmlNames.ELEMENT_PcGts.equals(localName)){
+
+ String str = namespaceURI; //Example: http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19
+ int pos = str.lastIndexOf("/");
+ schemaVersion = new XmlFormatVersion(str.substring(pos+1));
+ throw new SAXException(PARSING_COMPLETE);
+ }
+ //Abbyy
+ else if ("document".equals(localName)) {
+ //String str = namespaceURI; //Example: http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml
+ if (namespaceURI.contains("abbyy")) {
+ schemaVersion = new XmlFormatVersion(namespaceURI);
+ throw new SAXException(PARSING_COMPLETE);
+ }
+ }
+ //ALTO
+ else if ("alto".equals(localName)) {
+ //String str = namespaceURI; //Example: http://www.loc.gov/standards/alto/ns-v2#
+ if (namespaceURI.contains("alto")) {
+ schemaVersion = new XmlFormatVersion(namespaceURI);
+ throw new SAXException(PARSING_COMPLETE);
+ }
+ }
+ //HOCR
+ else if ("html".equals(localName)) {
+ schemaVersion = new XmlFormatVersion("HOCR");
+ throw new SAXException(PARSING_COMPLETE);
+ }
+ }
+ }
+
+
+
+
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter.java
new file mode 100644
index 00000000..4ed609ff
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter.java
@@ -0,0 +1,54 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.util.List;
+
+import org.primaresearch.dla.page.io.PageWriter;
+import org.primaresearch.dla.page.layout.converter.ConversionMessage;
+
+/**
+ * Interface for page writers producing XML.
+ *
+ * @author Christian Clausner
+ */
+public interface XmlPageWriter extends PageWriter {
+
+ /**
+ * Returns the XML schema version the writer supports (in format yyyy-mm-dd).
+ */
+ public String getSchemaVersion();
+
+ /**
+ * Returns the location of the schema (e.g. http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19).
+ */
+ public String getSchemaLocation();
+
+ /**
+ * Returns the URL of the schema file (e.g. http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19/pagecontent.xsd).
+ */
+ public String getSchemaUrl();
+
+ /**
+ * Returns the name space. This is usually the same as the schema location.
+ */
+ public String getNamespace();
+
+ /**
+ * Returns format conversion related messages
+ */
+ public List getConversionInformation();
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter_2010_03_19.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter_2010_03_19.java
new file mode 100644
index 00000000..08f7cc6e
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter_2010_03_19.java
@@ -0,0 +1,507 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.OutputStream;
+import java.text.DateFormat;
+import java.text.SimpleDateFormat;
+import java.util.List;
+
+import javax.xml.parsers.DocumentBuilder;
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.transform.Transformer;
+import javax.xml.transform.TransformerConfigurationException;
+import javax.xml.transform.TransformerException;
+import javax.xml.transform.TransformerFactory;
+import javax.xml.transform.dom.DOMSource;
+import javax.xml.transform.stream.StreamResult;
+import javax.xml.validation.Validator;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.io.FileTarget;
+import org.primaresearch.dla.page.io.OutputTarget;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.converter.ConversionMessage;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.GroupMember;
+import org.primaresearch.dla.page.layout.logical.Layer;
+import org.primaresearch.dla.page.layout.logical.Layers;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.logical.RegionRef;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextContainer;
+import org.primaresearch.dla.page.layout.physical.text.TextObject;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.io.UnsupportedFormatVersionException;
+import org.primaresearch.io.xml.IOError;
+import org.primaresearch.io.xml.XmlValidator;
+import org.primaresearch.maths.geometry.Point;
+import org.primaresearch.maths.geometry.Polygon;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.VariableMap;
+import org.w3c.dom.Document;
+import org.w3c.dom.Element;
+import org.w3c.dom.Text;
+import org.xml.sax.SAXException;
+
+/**
+ * Page writer implementation for XML files.
+ *
+ * @author Christian Clausner
+ */
+public class XmlPageWriter_2010_03_19 implements XmlPageWriter {
+ //TODO Rename class? It may be used to save files conform to other schemas.
+
+ private static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
+
+ private Page page = null;
+ private PageLayout layout = null;
+ private XmlNameProvider xmlNameProvider;
+ private Document doc;
+ private XmlValidator validator;
+ private PageErrorHandler lastErrors;
+ private List lastConversionMessages;
+
+
+ /**
+ * Constructor
+ *
+ * @param validator Optional schema validator (use null if not required).
+ */
+ public XmlPageWriter_2010_03_19(XmlValidator validator) {
+ xmlNameProvider = new DefaultXmlNames();
+ this.validator = validator;
+ }
+
+ public String getSchemaVersion() {
+ return validator != null ? validator.getSchemaVersion().toString() : "2010-03-19";
+ }
+
+ //TODO Path and filename need to be variable.
+ public String getSchemaLocation() {
+ return "http://schema.primaresearch.org/PAGE/gts/pagecontent/"+getSchemaVersion();
+ }
+
+ //TODO Path and filename need to be variable.
+ public String getSchemaUrl() {
+ return "http://schema.primaresearch.org/PAGE/gts/pagecontent/"+getSchemaVersion()+"/pagecontent.xsd";
+ }
+
+ public String getNamespace() {
+ return getSchemaLocation();
+ }
+
+ /**
+ * Writes the given Page object to an XML file.
+ *
+ * @param page Page object
+ * @param target FileTarget representing an XML file
+ * @return Returns true if written successfully, false otherwise.
+ */
+ @Override
+ public boolean write(Page page, OutputTarget target) throws UnsupportedFormatVersionException {
+ return run(page, target, false);
+ }
+
+ /**
+ * Validates the given Page object against the XML schema.
+ *
+ * @param page Page object
+ * @return Returns true if valid, false otherwise.
+ */
+ @Override
+ public boolean validate(Page page) throws UnsupportedFormatVersionException {
+ return run(page, null, true);
+ }
+
+ private boolean run(Page page, OutputTarget target, boolean validateOnly) throws UnsupportedFormatVersionException {
+ if (validator != null && !validator.getSchemaVersion().equals(page.getFormatVersion()))
+ throw new UnsupportedFormatVersionException("XML page writer doesn't support format: "+page.getFormatVersion().toString());
+
+ this.page = page;
+ layout = page.getLayout();
+ lastErrors = new PageErrorHandler();
+
+ //Convert page file if necessary and possible
+ //if (validator != null)
+ // lastConversionMessages = ConverterHub.convert(page, validator.getSchemaVersion());
+
+ DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
+ dbfac.setValidating(false);
+ dbfac.setNamespaceAware(true);
+ //if (validator != null)
+ //dbfac.setSchema(validator.getSchema());
+
+ DocumentBuilder docBuilder;
+ try {
+ docBuilder = dbfac.newDocumentBuilder();
+ //docBuilder.setErrorHandler(lastErrors);
+
+ doc = docBuilder.newDocument();
+
+ writeRoot();
+
+ //Validation errors?
+ if (validator != null) {
+ Validator domVal = validator.getSchema().newValidator();
+ domVal.setErrorHandler(lastErrors);
+
+ try {
+ domVal.validate(new DOMSource(doc));
+ } catch (SAXException e) {
+ e.printStackTrace();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ }
+ if (lastErrors.hasErrors()) {
+ return false;
+ }
+
+ //Write XML
+ if (!validateOnly) {
+
+ TransformerFactory transfac = TransformerFactory.newInstance();
+ Transformer trans = transfac.newTransformer();
+ DOMSource source = new DOMSource(doc);
+
+ OutputStream os = null;
+
+ if (target instanceof FileTarget) {
+ File f = ((FileTarget)target).getFile();
+ os = new FileOutputStream(f);
+ } else if (target instanceof StreamTarget)
+ os = ((StreamTarget) target).getOutputStream();
+
+ StreamResult result = new StreamResult(os);
+ trans.transform(source, result);
+ os.close();
+ }
+ return true;
+ } catch (ParserConfigurationException e) {
+ e.printStackTrace();
+ } catch (TransformerConfigurationException e) {
+ e.printStackTrace();
+ } catch (FileNotFoundException e) {
+ e.printStackTrace();
+ } catch (TransformerException e) {
+ e.printStackTrace();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ return false;
+ }
+
+ public List getErrors() {
+ return lastErrors != null ? lastErrors.getErrors() : null;
+ }
+
+ public List getWarnings() {
+ return lastErrors != null ? lastErrors.getWarnings() : null;
+ }
+
+ private void writeRoot() /*throws XMLStreamException*/ {
+ String xmlns = getSchemaLocation();
+ //String xsi = "http://www.w3.org/2001/XMLSchema-instance";
+
+ Element root = doc.createElementNS(xmlns, DefaultXmlNames.ELEMENT_PcGts);
+ doc.appendChild(root);
+
+ //xmlns
+ //addAttribute(root, "xmlns", xmlns);
+
+ //xmlns:xsi
+ //addAttribute(root, "xmlns:xsi", xsi);
+
+ //Schema location
+ String schemaLocation = getSchemaLocation() + " " + getSchemaUrl();
+ root.setAttributeNS("http://www.w3.org/2001/XMLSchema-instance", "xsi:schemaLocation", schemaLocation);
+ //addAttribute(root, "xsi:schemaLocation", schemaLocation);
+
+ //GtsID
+ if (page.getGtsId() != null)
+ addAttribute(root, DefaultXmlNames.ATTR_pcGtsId, page.getGtsId().toString());
+
+ addMetaData(root);
+ addPage(root);
+ }
+
+ private void addAttribute(Element node, String name, String value) {
+ node.setAttributeNS(null, name, value);
+ }
+
+ private void addMetaData(Element parent) /*throws XMLStreamException*/ {
+ MetaData metaData = page.getMetaData();
+ if (metaData == null)
+ return;
+
+ Element metaDataNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Metadata);
+ parent.appendChild(metaDataNode);
+
+ //Creator
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_Creator, metaData.getCreator());
+
+ //Created
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_Created, DATE_FORMAT.format(metaData.getCreationTime()));
+
+ //Last modified
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_LastChange, DATE_FORMAT.format(metaData.getLastModificationTime()));
+
+ //Comments
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_Comments, metaData.getComments());
+ }
+
+ private void addPage(Element parent) {
+ Element pageNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Page);
+ parent.appendChild(pageNode);
+
+ //Image filename
+ addAttribute(pageNode, DefaultXmlNames.ATTR_imageFilename, page.getImageFilename());
+
+ //Width/height
+ addAttribute(pageNode, DefaultXmlNames.ATTR_imageWidth, Integer.toString(layout.getWidth()));
+ addAttribute(pageNode, DefaultXmlNames.ATTR_imageHeight, Integer.toString(layout.getHeight()));
+
+ //Border
+ GeometricObject border = layout.getBorder();
+ if (border != null) {
+ Element node = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Border);
+ pageNode.appendChild(node);
+ addCoords(node, border.getCoords());
+ }
+
+ //Print space
+ GeometricObject printSpace = layout.getPrintSpace();
+ if (printSpace != null) {
+ Element node = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_PrintSpace);
+ pageNode.appendChild(node);
+ addCoords(node, printSpace.getCoords());
+ }
+
+ //Reading order
+ addReadingOrder(pageNode, layout.getReadingOrder());
+
+ //Layers
+ addLayers(pageNode, layout.getLayers());
+
+ //Regions
+ for (int i=0; i= 0 ? DefaultXmlNames.ELEMENT_OrderedGroupIndexed : DefaultXmlNames.ELEMENT_OrderedGroup;
+ else
+ groupElementName = index >= 0 ? DefaultXmlNames.ELEMENT_UnorderedGroupIndexed : DefaultXmlNames.ELEMENT_UnorderedGroup;
+
+
+ Element groupNode = doc.createElementNS(getNamespace(), groupElementName);
+ parent.appendChild(groupNode);
+
+ //ID
+ addAttribute(groupNode, DefaultXmlNames.ATTR_id, group.getId().toString());
+
+ //Index
+ if (index >= 0)
+ addAttribute(groupNode, DefaultXmlNames.ATTR_index, Integer.toString(index));
+ //eventWriter.add(eventFactory.createAttribute(DefaultXmlNames.ATTR_index, Integer.toString(index)));
+
+ //Children
+ GroupMember member;
+ for (int i=0; i= 0 ? DefaultXmlNames.ELEMENT_RegionRefIndexed : DefaultXmlNames.ELEMENT_RegionRef;
+
+ Element refNode = doc.createElementNS(getNamespace(), elementName);
+ parent.appendChild(refNode);
+
+ //ID Ref
+ addAttribute(refNode, DefaultXmlNames.ATTR_regionRef, regionId);
+
+ //Index
+ if (index >= 0)
+ addAttribute(refNode, DefaultXmlNames.ATTR_index, Integer.toString(index));
+ }
+
+ private void addLayers(Element parent, Layers layers) /*throws XMLStreamException*/ {
+ if (layers == null || layers.getSize() == 0)
+ return;
+
+ //Check if there are non-empty layers
+ boolean foundNonEmptyLayer = false;
+ for (int i=0; i skip the whole layers element
+ return;
+
+
+ Element layersNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Layers);
+ parent.appendChild(layersNode);
+
+ Layer layer;
+ for (int i=0; i 0)
+ addLayer(layersNode, layer);
+ }
+ }
+
+ private void addLayer(Element parent, Layer layer) /*throws XMLStreamException*/ {
+
+ Element layerNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Layer);
+ parent.appendChild(layerNode);
+
+ //ID
+ addAttribute(layerNode, DefaultXmlNames.ATTR_id, layer.getId().toString());
+
+ //Z-Index
+ addAttribute(layerNode, DefaultXmlNames.ATTR_zIndex, Integer.toString(layer.getZIndex()));
+
+ //Region Refs
+ GroupMember member;
+ for (int i=0; i getConversionInformation() {
+ return lastConversionMessages;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter_2013_07_15.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter_2013_07_15.java
new file mode 100644
index 00000000..ddd35642
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/XmlPageWriter_2013_07_15.java
@@ -0,0 +1,649 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.OutputStream;
+import java.text.DateFormat;
+import java.text.SimpleDateFormat;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import javax.xml.parsers.DocumentBuilder;
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import javax.xml.transform.Transformer;
+import javax.xml.transform.TransformerConfigurationException;
+import javax.xml.transform.TransformerException;
+import javax.xml.transform.TransformerFactory;
+import javax.xml.transform.dom.DOMSource;
+import javax.xml.transform.stream.StreamResult;
+import javax.xml.validation.Validator;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.Page.AlternativeImage;
+import org.primaresearch.dla.page.io.FileTarget;
+import org.primaresearch.dla.page.io.OutputTarget;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.converter.ConversionMessage;
+import org.primaresearch.dla.page.layout.logical.ContentObjectRelation;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.GroupMember;
+import org.primaresearch.dla.page.layout.logical.Layer;
+import org.primaresearch.dla.page.layout.logical.Layers;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.logical.RegionRef;
+import org.primaresearch.dla.page.layout.logical.Relations;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.RegionContainer;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextContainer;
+import org.primaresearch.dla.page.layout.physical.text.TextObject;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextLine;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.io.FormatModel;
+import org.primaresearch.io.UnsupportedFormatVersionException;
+import org.primaresearch.io.xml.IOError;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlValidator;
+import org.primaresearch.maths.geometry.Point;
+import org.primaresearch.maths.geometry.Polygon;
+import org.primaresearch.shared.variable.DoubleValue;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.VariableMap;
+import org.w3c.dom.DOMImplementation;
+import org.w3c.dom.Document;
+import org.w3c.dom.Element;
+import org.w3c.dom.Text;
+import org.xml.sax.SAXException;
+
+/**
+ * Page writer implementation for XML files.
+ *
+ * @author Christian Clausner
+ */
+public class XmlPageWriter_2013_07_15 implements XmlPageWriter {
+ //TODO Rename class? It may be used to save files conform to other schemas.
+
+ private static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
+
+ private Page page = null;
+ private PageLayout layout = null;
+ private XmlNameProvider xmlNameProvider;
+ private Document doc;
+ private XmlValidator validator;
+ private PageErrorHandler lastErrors;
+ private List lastConversionMessages;
+ private String namespace;
+
+
+ /**
+ * Constructor
+ *
+ * @param validator Optional schema validator (use null if not required).
+ */
+ public XmlPageWriter_2013_07_15(XmlValidator validator) {
+ xmlNameProvider = new DefaultXmlNames();
+ this.validator = validator;
+ }
+
+ @Override
+ public String getSchemaVersion() {
+ return validator != null ? validator.getSchemaVersion().toString() : "2013-07-15";
+ }
+
+ //TODO Path and filename need to be variable.
+ @Override
+ public String getSchemaLocation() {
+ return "http://schema.primaresearch.org/PAGE/gts/pagecontent/"+getSchemaVersion();
+ }
+
+ //TODO Path and filename need to be variable.
+ @Override
+ public String getSchemaUrl() {
+ return "http://schema.primaresearch.org/PAGE/gts/pagecontent/"+getSchemaVersion()+"/pagecontent.xsd";
+ }
+
+ @Override
+ public String getNamespace() {
+ return getSchemaLocation();
+ }
+
+ /**
+ * Writes the given Page object to an XML file.
+ *
+ * @param page Page object
+ * @param target FileTarget representing an XML file
+ * @return Returns true if written successfully, false otherwise.
+ */
+ @Override
+ public boolean write(Page page, OutputTarget target) throws UnsupportedFormatVersionException {
+ return run(page, target, false);
+ }
+
+ /**
+ * Validates the given Page object against the XML schema.
+ *
+ * @param page Page object
+ * @return Returns true if valid, false otherwise.
+ */
+ @Override
+ public boolean validate(Page page) throws UnsupportedFormatVersionException {
+ return run(page, null, true);
+ }
+
+ private boolean run(Page page, OutputTarget target, boolean validateOnly) throws UnsupportedFormatVersionException {
+ if (validator != null && !validator.getSchemaVersion().equals(page.getFormatVersion()))
+ throw new UnsupportedFormatVersionException("XML page writer doesn't support format: "+page.getFormatVersion().toString());
+
+ this.page = page;
+ layout = page.getLayout();
+ lastErrors = new PageErrorHandler();
+
+ //Convert page file if necessary and possible
+ //if (validator != null)
+ // lastConversionMessages = ConverterHub.convert(page, validator.getSchemaVersion());
+
+ DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
+ dbfac.setValidating(false);
+ dbfac.setNamespaceAware(true);
+ //if (validator != null)
+ //dbfac.setSchema(validator.getSchema());
+
+ DocumentBuilder docBuilder;
+ try {
+ docBuilder = dbfac.newDocumentBuilder();
+ //docBuilder.setErrorHandler(lastErrors);
+
+ DOMImplementation domImpl = docBuilder.getDOMImplementation();
+ //doc = docBuilder.newDocument();
+ namespace = getSchemaLocation();
+ doc = domImpl.createDocument(namespace, DefaultXmlNames.ELEMENT_PcGts, null);
+
+ writeRoot();
+
+ //Validation errors?
+ if (validator != null) {
+ Validator domVal = validator.getSchema().newValidator();
+ domVal.setErrorHandler(lastErrors);
+
+ try {
+ domVal.validate(new DOMSource(doc));
+ } catch (SAXException e) {
+ e.printStackTrace();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ }
+ if (lastErrors.hasErrors()) {
+ return false;
+ }
+
+ //Write XML
+ if (!validateOnly) {
+
+ TransformerFactory transfac = TransformerFactory.newInstance();
+ Transformer trans = transfac.newTransformer();
+ DOMSource source = new DOMSource(doc);
+
+ OutputStream os = null;
+
+ if (target instanceof FileTarget) {
+ File f = ((FileTarget)target).getFile();
+ os = new FileOutputStream(f);
+ } else if (target instanceof StreamTarget)
+ os = ((StreamTarget) target).getOutputStream();
+
+ StreamResult result = new StreamResult(os);
+ trans.transform(source, result);
+ os.close();
+ }
+ return true;
+ } catch (ParserConfigurationException e) {
+ e.printStackTrace();
+ } catch (TransformerConfigurationException e) {
+ e.printStackTrace();
+ } catch (FileNotFoundException e) {
+ e.printStackTrace();
+ } catch (TransformerException e) {
+ e.printStackTrace();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ return false;
+ }
+
+ /**
+ * Returns a list of writing errors
+ */
+ public List getErrors() {
+ return lastErrors != null ? lastErrors.getErrors() : null;
+ }
+
+ /**
+ * Returns a list of writing warnings
+ */
+ public List getWarnings() {
+ return lastErrors != null ? lastErrors.getWarnings() : null;
+ }
+
+ private void writeRoot() /*throws XMLStreamException*/ {
+ //String xsi = "http://www.w3.org/2001/XMLSchema-instance";
+
+ //Element root = doc.createElementNS(namespace, DefaultXmlNames.ELEMENT_PcGts);
+ //doc.appendChild(root);
+
+ Element root = doc.getDocumentElement();
+
+ //xmlns
+ //addAttribute(root, "xmlns", xmlns);
+
+ //xmlns:xsi
+ //root.setAttribute("xmlns:xsi", xsi);
+ //addAttribute(root, "xmlns:xsi", xsi);
+
+ //Schema location
+ String schemaLocation = getSchemaLocation() + " " + getSchemaUrl();
+ root.setAttributeNS("http://www.w3.org/2001/XMLSchema-instance", "xsi:schemaLocation", schemaLocation);
+ //addAttribute(root, "xsi:schemaLocation", schemaLocation);
+
+ //GtsID
+ if (page.getGtsId() != null)
+ addAttribute(root, DefaultXmlNames.ATTR_pcGtsId, page.getGtsId().toString());
+
+ addMetaData(root);
+ addPage(root);
+ }
+
+ private void addAttribute(Element node, String name, String value) {
+ node.setAttributeNS(null, name, value);
+ }
+
+ private void addMetaData(Element parent) /*throws XMLStreamException*/ {
+ MetaData metaData = page.getMetaData();
+ if (metaData == null)
+ return;
+
+ Element metaDataNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Metadata);
+ parent.appendChild(metaDataNode);
+
+ //Creator
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_Creator, metaData.getCreator());
+
+ //Created
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_Created, DATE_FORMAT.format(metaData.getCreationTime()));
+
+ //Last modified
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_LastChange, DATE_FORMAT.format(metaData.getLastModificationTime()));
+
+ //Comments
+ addTextElement(metaDataNode, DefaultXmlNames.ELEMENT_Comments, metaData.getComments());
+ }
+
+ private void addPage(Element parent) {
+ Element pageNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Page);
+ parent.appendChild(pageNode);
+
+ //Image filename
+ addAttribute(pageNode, DefaultXmlNames.ATTR_imageFilename, page.getImageFilename());
+
+ //Width/height
+ addAttribute(pageNode, DefaultXmlNames.ATTR_imageWidth, Integer.toString(layout.getWidth()));
+ addAttribute(pageNode, DefaultXmlNames.ATTR_imageHeight, Integer.toString(layout.getHeight()));
+
+ //Other Attributes (page type, ...)
+ addContentObjectAttributes(pageNode, page.getAttributes());
+
+ //Alternative images
+ List altImages = page.getAlternativeImages();
+ if (altImages != null) {
+ for (Iterator it = altImages.iterator(); it.hasNext(); ) {
+ AlternativeImage img = it.next();
+
+ Element node = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_AlternativeImage);
+ pageNode.appendChild(node);
+ addAttribute(node, DefaultXmlNames.ATTR_filename, img.getFilename());
+ if (!img.getComments().isEmpty())
+ addAttribute(node, DefaultXmlNames.ATTR_comments, img.getComments());
+ }
+ }
+
+ //Border
+ GeometricObject border = layout.getBorder();
+ if (border != null) {
+ Element node = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Border);
+ pageNode.appendChild(node);
+ addCoords(node, border.getCoords());
+ }
+
+ //Print space
+ GeometricObject printSpace = layout.getPrintSpace();
+ if (printSpace != null) {
+ Element node = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_PrintSpace);
+ pageNode.appendChild(node);
+ addCoords(node, printSpace.getCoords());
+ }
+
+ //Reading order
+ addReadingOrder(pageNode, layout.getReadingOrder());
+
+ //Layers
+ addLayers(pageNode, layout.getLayers());
+
+ //Relations
+ addRelations(pageNode, layout.getRelations());
+
+ //Regions
+ for (int i=0; i= 2) {
+ Element baselineNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Baseline);
+ regionNode.appendChild(baselineNode);
+ addPointsAttribute(baselineNode, baseline);
+ }
+ }
+
+ // Nested regions
+ if (contentObj instanceof RegionContainer) {
+ RegionContainer cont = (RegionContainer)contentObj;
+ if (cont.hasRegions()) {
+ for (int i=0; i0)
+ pointList.append(" ");
+ pointList.append(Integer.toString(p.x));
+ pointList.append(",");
+ pointList.append(Integer.toString(p.y));
+ }
+ addAttribute(parent, DefaultXmlNames.ATTR_points, pointList.toString());
+ }
+
+ private void addReadingOrder(Element parent, ReadingOrder order) /*throws XMLStreamException*/ {
+ if (order == null || order.getRoot() == null || order.getRoot().getSize() == 0)
+ return;
+
+ Element node = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_ReadingOrder);
+ parent.appendChild(node);
+
+ //Root group
+ addReadingOrderGroup(node, order.getRoot(), -1);
+ }
+
+ /**
+ * Writes a reading order group including its members.
+ * @param index Index of the group in the parent group (use -1 if not indexed).
+ */
+ private void addReadingOrderGroup(Element parent, Group group, int index) /*throws XMLStreamException*/ {
+ String groupElementName;
+ if (group.isOrdered())
+ groupElementName = index >= 0 ? DefaultXmlNames.ELEMENT_OrderedGroupIndexed : DefaultXmlNames.ELEMENT_OrderedGroup;
+ else
+ groupElementName = index >= 0 ? DefaultXmlNames.ELEMENT_UnorderedGroupIndexed : DefaultXmlNames.ELEMENT_UnorderedGroup;
+
+
+ Element groupNode = doc.createElementNS(getNamespace(), groupElementName);
+ parent.appendChild(groupNode);
+
+ //ID
+ addAttribute(groupNode, DefaultXmlNames.ATTR_id, group.getId().toString());
+
+ //Caption
+ if (group.getCaption() != null)
+ addAttribute(groupNode, DefaultXmlNames.ATTR_caption, group.getCaption());
+
+ //Index
+ if (index >= 0)
+ addAttribute(groupNode, DefaultXmlNames.ATTR_index, Integer.toString(index));
+ //eventWriter.add(eventFactory.createAttribute(DefaultXmlNames.ATTR_index, Integer.toString(index)));
+
+ //Children
+ GroupMember member;
+ for (int i=0; i= 0 ? DefaultXmlNames.ELEMENT_RegionRefIndexed : DefaultXmlNames.ELEMENT_RegionRef;
+
+ Element refNode = doc.createElementNS(getNamespace(), elementName);
+ parent.appendChild(refNode);
+
+ //ID Ref
+ addAttribute(refNode, DefaultXmlNames.ATTR_regionRef, regionId);
+
+ //Index
+ if (index >= 0)
+ addAttribute(refNode, DefaultXmlNames.ATTR_index, Integer.toString(index));
+ }
+
+ private void addLayers(Element parent, Layers layers) /*throws XMLStreamException*/ {
+ if (layers == null || layers.getSize() == 0)
+ return;
+
+ //Check if there are non-empty layers
+ boolean foundNonEmptyLayer = false;
+ for (int i=0; i skip the whole layers element
+ return;
+
+
+ Element layersNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Layers);
+ parent.appendChild(layersNode);
+
+ Layer layer;
+ for (int i=0; i 0)
+ addLayer(layersNode, layer);
+ }
+ }
+
+ private void addLayer(Element parent, Layer layer) /*throws XMLStreamException*/ {
+
+ Element layerNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Layer);
+ parent.appendChild(layerNode);
+
+ //ID
+ addAttribute(layerNode, DefaultXmlNames.ATTR_id, layer.getId().toString());
+
+ //Z-Index
+ addAttribute(layerNode, DefaultXmlNames.ATTR_zIndex, Integer.toString(layer.getZIndex()));
+
+ //Caption
+ if (layer.getCaption() != null)
+ addAttribute(layerNode, DefaultXmlNames.ATTR_caption, layer.getCaption());
+
+ //Region Refs
+ GroupMember member;
+ for (int i=0; i set = relations.exportRelations();
+ for (Iterator it = set.iterator(); it.hasNext(); ) {
+ ContentObjectRelation rel = it.next();
+ if (rel != null) {
+ Element relationNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_Relation);
+ relationsNode.appendChild(relationNode);
+
+ //Type
+ addAttribute(relationNode, DefaultXmlNames.ATTR_type, rel.getRelationType().toString());
+ //Custom
+ if (!rel.getCustomField().isEmpty())
+ addAttribute(relationNode, DefaultXmlNames.ATTR_custom, rel.getCustomField());
+ //Comments
+ if (!rel.getComments().isEmpty())
+ addAttribute(relationNode, DefaultXmlNames.ATTR_comments, rel.getComments());
+
+ //Object 1
+ Element regionRefNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_RegionRef);
+ relationNode.appendChild(regionRefNode);
+ addAttribute(regionRefNode, DefaultXmlNames.ATTR_regionRef, rel.getObject1().getId().toString());
+
+ //Object 2
+ regionRefNode = doc.createElementNS(getNamespace(), DefaultXmlNames.ELEMENT_RegionRef);
+ relationNode.appendChild(regionRefNode);
+ addAttribute(regionRefNode, DefaultXmlNames.ATTR_regionRef, rel.getObject2().getId().toString());
+ }
+ }
+ }
+
+
+ /**
+ * Writes a single element with text content.
+ */
+ private void addTextElement(Element parent, String elementName, String text) /*throws XMLStreamException*/ {
+ Element node = doc.createElementNS(getNamespace(), elementName);
+ parent.appendChild(node);
+
+ Text textNode = doc.createTextNode(text != null ? text : "");
+ node.appendChild(textNode);
+ }
+
+ @Override
+ public List getConversionInformation() {
+ return lastConversionMessages;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler.java
new file mode 100644
index 00000000..f8731af2
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler.java
@@ -0,0 +1,39 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import org.primaresearch.dla.page.Page;
+import org.xml.sax.helpers.DefaultHandler;
+
+/**
+ * Abstract base class for SAX handlers intended for PAGE XML.
+ *
+ * @author Christian Clausner
+ *
+ */
+public abstract class SaxPageHandler extends DefaultHandler {
+
+ /**
+ * Returns the page object that has been created from XML
+ * @return Page object
+ */
+ abstract public Page getPageObject();
+
+
+
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandlerFactory.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandlerFactory.java
new file mode 100644
index 00000000..e7bae4d4
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandlerFactory.java
@@ -0,0 +1,61 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+
+/**
+ * Creates SAX handlers for PAGE XML.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandlerFactory {
+
+ /**
+ * Creates a handler for the given format
+ * @param validatorProvider Provider for XML validators
+ * @param schemaVersion XML schema version for the format
+ * @return New handler object
+ */
+ public static SaxPageHandler createHandler(XmlModelAndValidatorProvider validatorProvider, XmlFormatVersion schemaVersion) {
+
+ if (schemaVersion != null) {
+
+ if (schemaVersion instanceof XmlFormatVersion) {
+ //Abbyy
+ if (((XmlFormatVersion)schemaVersion).toString().equals("http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml"))
+ return new SaxPageHandler_AbbyyFineReader10(validatorProvider, schemaVersion);
+ //HOCR
+ else if (((XmlFormatVersion)schemaVersion).toString().equals("HOCR"))
+ return new SaxPageHandler_Hocr();
+ //ALTO
+ else if (((XmlFormatVersion)schemaVersion).toString().equals("http://www.loc.gov/standards/alto/ns-v2#"))
+ return new SaxPageHandler_Alto_2_1(validatorProvider, schemaVersion);
+ }
+
+ //Old PAGE schemas
+ if (schemaVersion.isOlderThan(new XmlFormatVersion("2010-03-19")))
+ return new SaxPageHandlerLegacy(validatorProvider, schemaVersion);
+ else if (schemaVersion.isOlderThan(new XmlFormatVersion("2013-07-15")))
+ return new SaxPageHandler_2010_03_19(validatorProvider, schemaVersion);
+ }
+
+ //Latest schema
+ return new SaxPageHandler_2013_07_15(validatorProvider, schemaVersion);
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandlerLegacy.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandlerLegacy.java
new file mode 100644
index 00000000..434c454f
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandlerLegacy.java
@@ -0,0 +1,531 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import java.text.DateFormat;
+import java.text.ParseException;
+import java.text.SimpleDateFormat;
+import java.util.Date;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.io.xml.DefaultXmlNames;
+import org.primaresearch.dla.page.layout.GeometricObjectImpl;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.Layer;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.physical.text.TextObject;
+import org.primaresearch.dla.page.layout.physical.text.impl.Glyph;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextLine;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextRegion;
+import org.primaresearch.dla.page.layout.physical.text.impl.Word;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.ident.Identifiable;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+import org.primaresearch.maths.geometry.Polygon;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.VariableMap;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+
+/**
+ * Handler for PAGE schema version 2010-01-12 and older.
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandlerLegacy extends SaxPageHandler {
+
+ private static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
+
+
+ private Page page = null;
+ private PageLayout layout = null;
+ private MetaData metaData = null;
+
+ private GeometricObject currentGeometricObject = null;
+ private Region currentRegion = null;
+ private TextLine currentTextLine = null;
+ private Word currentWord = null;
+ private Glyph currentGlyph = null;
+ private TextObject currentTextObject = null;
+ private String insideElement = null;
+ private ReadingOrder readingOrder = null;
+ private Group currentLogicalGroup;
+ private StringBuffer currentText = null;
+ XmlModelAndValidatorProvider validatorProvider;
+ XmlFormatVersion schemaVersion;
+
+ public SaxPageHandlerLegacy(XmlModelAndValidatorProvider validatorProvider, XmlFormatVersion schemaVersion) {
+ this.validatorProvider = validatorProvider;
+ this.schemaVersion = schemaVersion;
+ }
+
+ public Page getPageObject() {
+ return page;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = localName;
+
+ if (DefaultXmlNames.ELEMENT_PcGts.equals(localName)){
+ createPageObject();
+ //GtsID
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_pcGtsId)) >= 0) {
+ try {
+ page.setGtsId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+ }
+ if (DefaultXmlNames.ELEMENT_Page.equals(localName)){
+ handlePageElement(atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_Border.equals(localName)
+ || DefaultXmlNames.ELEMENT_PrintSpace.equals(localName)) {
+ currentGeometricObject = new GeometricObjectImpl(new Polygon());
+ }
+ else if (DefaultXmlNames.ELEMENT_Coords.equals(localName)) {
+ if (currentGeometricObject != null)
+ currentGeometricObject.setCoords(new Polygon());
+ }
+ else if (DefaultXmlNames.ELEMENT_Point.equals(localName)) {
+ handlePolygonPoint(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.TextRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ currentTextObject = (TextObject)currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ImageRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.ImageRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_GraphicRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.GraphicRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_LineDrawingRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.LineDrawingRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ChartRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.ChartRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_SeparatorRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.SeparatorRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_MathsRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.MathsRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TableRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.TableRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_FrameRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.GraphicRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_NoiseRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.NoiseRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_UnknownRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.UnknownRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ if (currentRegion != null && currentRegion.getType() == RegionType.TextRegion)
+ currentTextLine = ((TextRegion)currentRegion).createTextLine(readId(atts));
+ currentGeometricObject = currentTextLine;
+ currentTextObject = currentTextLine;
+ handleContentObject(currentTextLine, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ if (currentTextLine != null)
+ currentWord = currentTextLine.createWord(readId(atts));
+ currentGeometricObject = currentWord;
+ currentTextObject = currentWord;
+ handleContentObject(currentWord, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ if (currentWord != null)
+ currentGlyph = currentWord.createGlyph(readId(atts));
+ currentGeometricObject = currentGlyph;
+ currentTextObject = currentGlyph;
+ handleContentObject(currentGlyph, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ReadingOrder.equals(localName)) {
+ readingOrder = layout.createReadingOrder();
+ currentLogicalGroup = readingOrder.getRoot();
+ }
+ else if ( DefaultXmlNames.ELEMENT_OrderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName)) {
+ if (currentLogicalGroup == readingOrder.getRoot())
+ currentLogicalGroup.setOrdered(DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName));
+
+ Group group;
+ try {
+ group = currentLogicalGroup.createChildGroup();
+ } catch (Exception e) {
+ e.printStackTrace();
+ return;
+ }
+ group.setOrdered(true);
+ parseId(group, atts);
+
+ currentLogicalGroup = group;
+ }
+ else if ( DefaultXmlNames.ELEMENT_UnorderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName)) {
+ if (currentLogicalGroup == readingOrder.getRoot())
+ currentLogicalGroup.setOrdered(DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName));
+
+ Group group;
+ try {
+ group = currentLogicalGroup.createChildGroup();
+ } catch (Exception e) {
+ e.printStackTrace();
+ return;
+ }
+ group.setOrdered(false);
+ parseId(group, atts);
+
+ currentLogicalGroup = group;
+ }
+ else if ( DefaultXmlNames.ELEMENT_RegionRef.equals(localName)
+ || DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) {
+ if (readingOrder != null) {
+ if (currentLogicalGroup == readingOrder.getRoot())
+ currentLogicalGroup.setOrdered(DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName));
+ }
+
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
+ currentLogicalGroup.addRegionRef(atts.getValue(i));
+ }
+ }
+ else if (DefaultXmlNames.ELEMENT_Layers.equals(localName)) {
+ layout.createLayers();
+ currentLogicalGroup = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_Layer.equals(localName)) {
+ Layer layer = layout.getLayers().createLayer();
+ currentLogicalGroup = layer;
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_zIndex)) >= 0) {
+ layer.setZIndex(new Integer(atts.getValue(i)));
+ }
+ parseId(layer, atts);
+ }
+ }
+
+ /**
+ * Receive notification of the end of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void endElement(String namespaceURI, String localName, String qName)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = null;
+
+ if (DefaultXmlNames.ELEMENT_Border.equals(localName)) {
+ layout.setBorder(currentGeometricObject);
+ currentGeometricObject = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_PrintSpace.equals(localName)) {
+ layout.setPrintSpace(currentGeometricObject);
+ currentGeometricObject = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_TextRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ImageRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_GraphicRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_LineDrawingRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ChartRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_SeparatorRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_MathsRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_TableRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_FrameRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_NoiseRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnknownRegion.equals(localName)
+ ) {
+ currentRegion = null;
+ currentGeometricObject = null;
+ currentTextObject = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ currentGeometricObject = currentRegion; //Set to parent
+ currentTextObject = (TextObject)currentRegion;
+ }
+ else if ( DefaultXmlNames.ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ currentGeometricObject = currentTextLine; //Set to parent
+ currentTextObject = currentTextLine;
+ }
+ else if ( DefaultXmlNames.ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ currentGeometricObject = currentWord; //Set to parent
+ currentTextObject = currentWord;
+ }
+ else if (DefaultXmlNames.ELEMENT_ReadingOrder.equals(localName)) {
+
+ //If the root group only contains one group as member, we make that member the root
+ Group root = readingOrder.getRoot();
+ if (root.getSize() == 1 && root.getMember(0) instanceof Group) {
+ Group child = (Group)root.getMember(0);
+ root.setOrdered(child.isOrdered());
+ //Copy all children of the child group to the root group
+ while (child.getSize()>0)
+ child.getMember(0).moveTo(root);
+ //Remove the child group from the root
+ root.delete(child);
+ }
+
+ currentLogicalGroup = null;
+ readingOrder = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_OrderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName)) {
+ currentLogicalGroup = currentLogicalGroup.getParent();
+ }
+ else if ( DefaultXmlNames.ELEMENT_UnorderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName)) {
+ currentLogicalGroup = currentLogicalGroup.getParent();
+ }
+ else if (DefaultXmlNames.ELEMENT_Layer.equals(localName)) {
+ currentLogicalGroup = null;
+ }
+ }
+
+ /**
+ * Receive notification of character data inside an element.
+ * @param ch - The characters.
+ * @param start - The start position in the character array.
+ * @param length - The number of characters to use from the character array.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void characters(char[] ch, int start, int length)
+ throws SAXException {
+
+ String strValue = new String(ch, start, length);
+
+ //Text might be parsed bit by bit, so we have to accumulate until a closing tag is found.
+ if (currentText == null)
+ currentText = new StringBuffer();
+ currentText.append(strValue);
+ }
+
+ /**
+ * Writes accumulated text to the right object.
+ */
+ private void finishText() {
+ if (currentText != null) {
+ String strValue = currentText.toString();
+
+ if (currentTextObject != null) {
+ if (DefaultXmlNames.ELEMENT_Unicode.equals(insideElement)) {
+ currentTextObject.setText(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_PlainText.equals(insideElement)) {
+ currentTextObject.setPlainText(strValue);
+ }
+ }
+ if (metaData != null) {
+ if (DefaultXmlNames.ELEMENT_Creator.equals(insideElement)) {
+ metaData.setCreator(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_Comments.equals(insideElement)) {
+ metaData.setComments(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_Created.equals(insideElement)) {
+ metaData.setCreationTime(parseDate(strValue));
+ }
+ else if (DefaultXmlNames.ELEMENT_LastChange.equals(insideElement)) {
+ metaData.setLastModifiedTime(parseDate(strValue));
+ }
+ }
+
+ currentText = null;
+ }
+ }
+
+ private void createPageObject() {
+ if (validatorProvider != null && schemaVersion != null) {
+ try {
+ page = new Page(validatorProvider.getSchemaParser(schemaVersion));
+ //page.setFormatVersion(schemaVersion);
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ page = new Page();
+ }
+ }
+ else
+ page = new Page();
+
+ layout = page.getLayout();
+ metaData = page.getMetaData();
+ }
+
+ /**
+ * Reads the attributes of the Page element.
+ */
+ private void handlePageElement(Attributes atts) {
+ int i;
+
+ //Size
+ int width = 0;
+ int height = 0;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageWidth)) >= 0) {
+ width = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageHeight)) >= 0) {
+ height = new Integer(atts.getValue(i));
+ }
+ page.getLayout().setSize(width, height);
+
+ //Image filename
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageFilename)) >= 0) {
+ page.setImageFilename(atts.getValue(i));
+ }
+
+ }
+
+ /**
+ * Reads the coordinates of a single polygon point.
+ */
+ private void handlePolygonPoint(Attributes atts) {
+ if (currentGeometricObject == null || currentGeometricObject.getCoords() == null)
+ return;
+
+ int x=0;
+ int y=0;
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_x)) >= 0) {
+ x = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_y)) >= 0) {
+ y = new Integer(atts.getValue(i));
+ }
+ currentGeometricObject.getCoords().addPoint(x, y);
+ }
+
+ /**
+ * Reads the attributes of a content object.
+ */
+ private void handleContentObject(ContentObject obj, Attributes atts) {
+
+ //Id
+ //parseId(obj, atts);
+
+ //Attributes
+ VariableMap map = obj.getAttributes();
+ int p;
+ for (int i=0; i= 0) {
+ var.parseValue(atts.getValue(p));
+ }
+ }
+ }
+
+ private String getXmlAttributeName(String name) {
+ return name; //TODO Should there be a mechanism to translate attribute names to XML names?
+ }
+
+ /**
+ * Parses a date given as string using the default date format.
+ */
+ private Date parseDate(String str) {
+ try {
+ return DATE_FORMAT.parse(str);
+ } catch (ParseException e) {
+ return new Date();
+ }
+ }
+
+ private String readId(Attributes atts) {
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_id)) >= 0)
+ return atts.getValue(i);
+ return "";
+ }
+
+ /**
+ * Reads the ID attribute and sets it in the Identifiable object.
+ */
+ private void parseId(Identifiable ident, Attributes atts) {
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_id)) >= 0) {
+ try {
+ ident.setId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ //TODO Manage ID conflicts
+ e.printStackTrace();
+ }
+ }
+ }
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2010_03_19.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2010_03_19.java
new file mode 100644
index 00000000..ff71442d
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2010_03_19.java
@@ -0,0 +1,515 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import java.text.DateFormat;
+import java.text.ParseException;
+import java.text.SimpleDateFormat;
+import java.util.Date;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.io.xml.DefaultXmlNames;
+import org.primaresearch.dla.page.layout.GeometricObjectImpl;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.Layer;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.physical.text.TextObject;
+import org.primaresearch.dla.page.layout.physical.text.impl.Glyph;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextLine;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextRegion;
+import org.primaresearch.dla.page.layout.physical.text.impl.Word;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.ident.Identifiable;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+import org.primaresearch.maths.geometry.Polygon;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.VariableMap;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+
+/**
+ * Legacy PAGE XML handler for 2010 format.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandler_2010_03_19 extends SaxPageHandler {
+
+ private static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
+
+
+ private Page page = null;
+ private PageLayout layout = null;
+ private MetaData metaData = null;
+
+ private GeometricObject currentGeometricObject = null;
+ private Region currentRegion = null;
+ private TextLine currentTextLine = null;
+ private Word currentWord = null;
+ private Glyph currentGlyph = null;
+ private TextObject currentTextObject = null;
+ private String insideElement = null;
+ private ReadingOrder readingOrder = null;
+ private Group currentLogicalGroup;
+ private StringBuffer currentText = null;
+ private XmlModelAndValidatorProvider validatorProvider;
+ private XmlFormatVersion schemaVersion;
+
+ public SaxPageHandler_2010_03_19(XmlModelAndValidatorProvider validatorProvider, XmlFormatVersion schemaVersion) {
+ this.validatorProvider = validatorProvider;
+ this.schemaVersion = schemaVersion;
+ }
+
+ public Page getPageObject() {
+ return page;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = localName;
+
+ if (DefaultXmlNames.ELEMENT_PcGts.equals(localName)){
+ createPageObject();
+ //GtsID
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_pcGtsId)) >= 0) {
+ try {
+ page.setGtsId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+ }
+ if (DefaultXmlNames.ELEMENT_Page.equals(localName)){
+ handlePageElement(atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_Border.equals(localName)
+ || DefaultXmlNames.ELEMENT_PrintSpace.equals(localName)) {
+ currentGeometricObject = new GeometricObjectImpl(new Polygon());
+ }
+ else if (DefaultXmlNames.ELEMENT_Coords.equals(localName)) {
+ if (currentGeometricObject != null)
+ currentGeometricObject.setCoords(new Polygon());
+ }
+ else if (DefaultXmlNames.ELEMENT_Point.equals(localName)) {
+ handlePolygonPoint(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.TextRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ currentTextObject = (TextObject)currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ImageRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.ImageRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_GraphicRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.GraphicRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_LineDrawingRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.LineDrawingRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ChartRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.ChartRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_SeparatorRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.SeparatorRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_MathsRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.MathsRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TableRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.TableRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_FrameRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.GraphicRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_NoiseRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.NoiseRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_UnknownRegion.equals(localName)) {
+ currentRegion = layout.createRegion(RegionType.UnknownRegion, readId(atts));
+ currentGeometricObject = currentRegion;
+ handleContentObject(currentRegion, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ if (currentRegion != null && currentRegion.getType() == RegionType.TextRegion)
+ currentTextLine = ((TextRegion)currentRegion).createTextLine(readId(atts));
+ currentGeometricObject = currentTextLine;
+ currentTextObject = currentTextLine;
+ handleContentObject(currentTextLine, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ if (currentTextLine != null)
+ currentWord = currentTextLine.createWord(readId(atts));
+ currentGeometricObject = currentWord;
+ currentTextObject = currentWord;
+ handleContentObject(currentWord, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ if (currentWord != null)
+ currentGlyph = currentWord.createGlyph(readId(atts));
+ currentGeometricObject = currentGlyph;
+ currentTextObject = currentGlyph;
+ handleContentObject(currentGlyph, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ReadingOrder.equals(localName)) {
+ readingOrder = layout.createReadingOrder();
+ currentLogicalGroup = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_OrderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName)) {
+ Group group;
+ if (currentLogicalGroup == null) //Root group
+ group = readingOrder.getRoot();
+ else //Child group
+ {
+ try {
+ group = currentLogicalGroup.createChildGroup();
+ } catch (Exception e) {
+ e.printStackTrace();
+ return;
+ }
+ }
+ group.setOrdered(true);
+ currentLogicalGroup = group;
+ parseId(group, atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_UnorderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName)) {
+ Group group;
+ if (currentLogicalGroup == null) //Root group
+ group = readingOrder.getRoot();
+ else //Child group
+ {
+ try {
+ group = currentLogicalGroup.createChildGroup();
+ } catch (Exception e) {
+ e.printStackTrace();
+ return;
+ }
+ }
+ group.setOrdered(false);
+ currentLogicalGroup = group;
+ parseId(group, atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_RegionRef.equals(localName)
+ || DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) {
+
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
+ currentLogicalGroup.addRegionRef(atts.getValue(i));
+ }
+ }
+ else if (DefaultXmlNames.ELEMENT_Layers.equals(localName)) {
+ layout.createLayers();
+ currentLogicalGroup = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_Layer.equals(localName)) {
+ Layer layer = layout.getLayers().createLayer();
+ currentLogicalGroup = layer;
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_zIndex)) >= 0) {
+ layer.setZIndex(new Integer(atts.getValue(i)));
+ }
+ parseId(layer, atts);
+ }
+ }
+
+ /**
+ * Receive notification of the end of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void endElement(String namespaceURI, String localName, String qName)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = null;
+
+ if (DefaultXmlNames.ELEMENT_Border.equals(localName)) {
+ layout.setBorder(currentGeometricObject);
+ currentGeometricObject = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_PrintSpace.equals(localName)) {
+ layout.setPrintSpace(currentGeometricObject);
+ currentGeometricObject = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_TextRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ImageRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_GraphicRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_LineDrawingRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ChartRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_SeparatorRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_MathsRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_TableRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_FrameRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_NoiseRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnknownRegion.equals(localName)
+ ) {
+ currentRegion = null;
+ currentGeometricObject = null;
+ currentTextObject = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ currentGeometricObject = currentRegion; //Set to parent
+ currentTextObject = (TextObject)currentRegion;
+ }
+ else if ( DefaultXmlNames.ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ currentGeometricObject = currentTextLine; //Set to parent
+ currentTextObject = currentTextLine;
+ }
+ else if ( DefaultXmlNames.ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ currentGeometricObject = currentWord; //Set to parent
+ currentTextObject = currentWord;
+ }
+ else if (DefaultXmlNames.ELEMENT_ReadingOrder.equals(localName)) {
+ currentLogicalGroup = null;
+ readingOrder = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_OrderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName)) {
+ currentLogicalGroup = currentLogicalGroup.getParent();
+ }
+ else if ( DefaultXmlNames.ELEMENT_UnorderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName)) {
+ currentLogicalGroup = currentLogicalGroup.getParent();
+ }
+ else if (DefaultXmlNames.ELEMENT_Layer.equals(localName)) {
+ currentLogicalGroup = null;
+ }
+ }
+
+ /**
+ * Receive notification of character data inside an element.
+ * @param ch - The characters.
+ * @param start - The start position in the character array.
+ * @param length - The number of characters to use from the character array.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void characters(char[] ch, int start, int length)
+ throws SAXException {
+
+ String strValue = new String(ch, start, length);
+
+ //Text might be parsed bit by bit, so we have to accumulate until a closing tag is found.
+ if (currentText == null)
+ currentText = new StringBuffer();
+ currentText.append(strValue);
+ }
+
+ /**
+ * Writes accumulated text to the right object.
+ */
+ private void finishText() {
+ if (currentText != null) {
+ String strValue = currentText.toString();
+
+ if (currentTextObject != null) {
+ if (DefaultXmlNames.ELEMENT_Unicode.equals(insideElement)) {
+ currentTextObject.setText(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_PlainText.equals(insideElement)) {
+ currentTextObject.setPlainText(strValue);
+ }
+ }
+ if (metaData != null) {
+ if (DefaultXmlNames.ELEMENT_Creator.equals(insideElement)) {
+ metaData.setCreator(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_Comments.equals(insideElement)) {
+ metaData.setComments(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_Created.equals(insideElement)) {
+ metaData.setCreationTime(parseDate(strValue));
+ }
+ else if (DefaultXmlNames.ELEMENT_LastChange.equals(insideElement)) {
+ metaData.setLastModifiedTime(parseDate(strValue));
+ }
+ }
+
+ currentText = null;
+ }
+ }
+
+ private void createPageObject() {
+ if (validatorProvider != null && schemaVersion != null) {
+ try {
+ page = new Page(validatorProvider.getSchemaParser(schemaVersion));
+ //page.setFormatVersion(schemaVersion);
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ page = new Page();
+ }
+ }
+ else
+ page = new Page();
+
+ layout = page.getLayout();
+ metaData = page.getMetaData();
+ }
+
+ /**
+ * Reads the attributes of the Page element.
+ */
+ private void handlePageElement(Attributes atts) {
+ int i;
+
+ //Size
+ int width = 0;
+ int height = 0;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageWidth)) >= 0) {
+ width = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageHeight)) >= 0) {
+ height = new Integer(atts.getValue(i));
+ }
+ page.getLayout().setSize(width, height);
+
+ //Image filename
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageFilename)) >= 0) {
+ page.setImageFilename(atts.getValue(i));
+ }
+
+ }
+
+ /**
+ * Reads the coordinates of a single polygon point.
+ */
+ private void handlePolygonPoint(Attributes atts) {
+ if (currentGeometricObject == null || currentGeometricObject.getCoords() == null)
+ return;
+
+ int x=0;
+ int y=0;
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_x)) >= 0) {
+ x = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_y)) >= 0) {
+ y = new Integer(atts.getValue(i));
+ }
+ currentGeometricObject.getCoords().addPoint(x, y);
+ }
+
+ /**
+ * Reads the attributes of a content object.
+ */
+ private void handleContentObject(ContentObject obj, Attributes atts) {
+
+ //Id
+ //parseId(obj, atts);
+
+ //Attributes
+ VariableMap map = obj.getAttributes();
+ int p;
+ for (int i=0; i= 0) {
+ var.parseValue(atts.getValue(p));
+ }
+ }
+ }
+
+ private String getXmlAttributeName(String name) {
+ return name; //TODO Should there be a mechanism to translate attribute names to XML names?
+ }
+
+ /**
+ * Parses a date given as string using the default date format.
+ */
+ private Date parseDate(String str) {
+ try {
+ return DATE_FORMAT.parse(str);
+ } catch (ParseException e) {
+ return new Date();
+ }
+ }
+
+ private String readId(Attributes atts) {
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_id)) >= 0)
+ return atts.getValue(i);
+ return "";
+ }
+
+ /**
+ * Reads the ID attribute and sets it in the Identifiable object.
+ */
+ private void parseId(Identifiable ident, Attributes atts) {
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_id)) >= 0) {
+ try {
+ ident.setId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ //TODO Manage ID conflicts
+ e.printStackTrace();
+ }
+ }
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2013_07_15.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2013_07_15.java
new file mode 100644
index 00000000..3c90a31e
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_2013_07_15.java
@@ -0,0 +1,733 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import java.text.DateFormat;
+import java.text.ParseException;
+import java.text.SimpleDateFormat;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Stack;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.Page.AlternativeImage;
+import org.primaresearch.dla.page.io.xml.DefaultXmlNames;
+import org.primaresearch.dla.page.layout.GeometricObjectImpl;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.logical.ContentObjectRelation;
+import org.primaresearch.dla.page.layout.logical.ContentObjectRelation.RelationType;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.Layer;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.logical.Relations;
+import org.primaresearch.dla.page.layout.physical.AttributeContainer;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.RegionContainer;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.physical.text.TextObject;
+import org.primaresearch.dla.page.layout.physical.text.impl.Glyph;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextLine;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextRegion;
+import org.primaresearch.dla.page.layout.physical.text.impl.Word;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.ident.Identifiable;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+import org.primaresearch.maths.geometry.Polygon;
+import org.primaresearch.shared.variable.StringValue;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.VariableMap;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+
+/**
+ * XML handler for 2013 PAGE format.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandler_2013_07_15 extends SaxPageHandler {
+
+ private static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
+
+
+ private Page page = null;
+ private PageLayout layout = null;
+ private MetaData metaData = null;
+
+ private GeometricObject currentGeometricObject = null;
+ private Region currentRegion = null;
+ private Stack regionStack = new Stack(); //for nesting of regions
+ private TextLine currentTextLine = null;
+ private Word currentWord = null;
+ private Glyph currentGlyph = null;
+ private TextObject currentTextObject = null;
+ private String insideElement = null;
+ private ReadingOrder readingOrder = null;
+ private Group currentLogicalGroup;
+ private StringBuffer currentText = null;
+ private XmlModelAndValidatorProvider validatorProvider;
+ private XmlFormatVersion schemaVersion;
+ private List> tempRelations;
+ private List currentRelation; //[type, custom, comments, id1, id2]
+ Map contentObjects = new HashMap();
+
+ public SaxPageHandler_2013_07_15(XmlModelAndValidatorProvider validatorProvider, XmlFormatVersion schemaVersion) {
+ this.validatorProvider = validatorProvider;
+ this.schemaVersion = schemaVersion;
+ }
+
+ public Page getPageObject() {
+ return page;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = localName;
+
+ if (DefaultXmlNames.ELEMENT_PcGts.equals(localName)){
+ createPageObject();
+ //GtsID
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_pcGtsId)) >= 0) {
+ try {
+ page.setGtsId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+ }
+ if (DefaultXmlNames.ELEMENT_Page.equals(localName)){
+ handlePageElement(atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_Border.equals(localName)
+ || DefaultXmlNames.ELEMENT_PrintSpace.equals(localName)) {
+ currentGeometricObject = new GeometricObjectImpl(new Polygon());
+ }
+ else if (DefaultXmlNames.ELEMENT_Coords.equals(localName)) {
+ handleCoords(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Baseline.equals(localName)) {
+ handleBaseline(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextRegion.equals(localName)) {
+ handleRegion(atts, RegionType.TextRegion);
+ currentTextObject = (TextObject)currentRegion;
+
+ }
+ else if (DefaultXmlNames.ELEMENT_ImageRegion.equals(localName)) {
+ handleRegion(atts, RegionType.ImageRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_GraphicRegion.equals(localName)) {
+ handleRegion(atts, RegionType.GraphicRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_LineDrawingRegion.equals(localName)) {
+ handleRegion(atts, RegionType.LineDrawingRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_ChartRegion.equals(localName)) {
+ handleRegion(atts, RegionType.ChartRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_SeparatorRegion.equals(localName)) {
+ handleRegion(atts, RegionType.SeparatorRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_MathsRegion.equals(localName)) {
+ handleRegion(atts, RegionType.MathsRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_TableRegion.equals(localName)) {
+ handleRegion(atts, RegionType.TableRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_AdvertRegion.equals(localName)) {
+ handleRegion(atts, RegionType.AdvertRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_ChemRegion.equals(localName)) {
+ handleRegion(atts, RegionType.ChemRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_MusicRegion.equals(localName)) {
+ handleRegion(atts, RegionType.MusicRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_FrameRegion.equals(localName)) {
+ handleRegion(atts, RegionType.GraphicRegion);
+ Variable v = currentRegion.getAttributes().get("type");
+ if (v != null)
+ {
+ try {
+ v.setValue(new StringValue("frame"));
+ } catch (Exception e) {
+ }
+ }
+ }
+ else if (DefaultXmlNames.ELEMENT_NoiseRegion.equals(localName)) {
+ handleRegion(atts, RegionType.NoiseRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_UnknownRegion.equals(localName)) {
+ handleRegion(atts, RegionType.UnknownRegion);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ if (currentRegion != null && currentRegion.getType() == RegionType.TextRegion)
+ currentTextLine = ((TextRegion)currentRegion).createTextLine(readId(atts));
+ currentGeometricObject = currentTextLine;
+ currentTextObject = currentTextLine;
+ contentObjects.put(currentTextLine.getId().toString(), currentTextLine);
+ handleAttributeContainer(currentTextLine, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ if (currentTextLine != null)
+ currentWord = currentTextLine.createWord(readId(atts));
+ currentGeometricObject = currentWord;
+ currentTextObject = currentWord;
+ contentObjects.put(currentWord.getId().toString(), currentWord);
+ handleAttributeContainer(currentWord, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ if (currentWord != null)
+ currentGlyph = currentWord.createGlyph(readId(atts));
+ currentGeometricObject = currentGlyph;
+ currentTextObject = currentGlyph;
+ contentObjects.put(currentGlyph.getId().toString(), currentGlyph);
+ handleAttributeContainer(currentGlyph, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextEquiv.equals(localName)) {
+ handleTextEquiv(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_ReadingOrder.equals(localName)) {
+ readingOrder = layout.createReadingOrder();
+ currentLogicalGroup = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_OrderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName)) {
+ Group group;
+ if (currentLogicalGroup == null) //Root group
+ group = readingOrder.getRoot();
+ else //Child group
+ {
+ try {
+ group = currentLogicalGroup.createChildGroup();
+ } catch (Exception e) {
+ e.printStackTrace();
+ return;
+ }
+ }
+ group.setOrdered(true);
+ currentLogicalGroup = group;
+ int i;
+ //Caption
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_caption)) >= 0) {
+ group.setCaption(atts.getValue(i));
+ }
+ parseId(group, atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_UnorderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName)) {
+ Group group;
+ if (currentLogicalGroup == null) //Root group
+ group = readingOrder.getRoot();
+ else //Child group
+ {
+ try {
+ group = currentLogicalGroup.createChildGroup();
+ } catch (Exception e) {
+ e.printStackTrace();
+ return;
+ }
+ }
+ group.setOrdered(false);
+ currentLogicalGroup = group;
+ int i;
+ //Caption
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_caption)) >= 0) {
+ group.setCaption(atts.getValue(i));
+ }
+ parseId(group, atts);
+ }
+ else if ( DefaultXmlNames.ELEMENT_RegionRef.equals(localName)
+ || DefaultXmlNames.ELEMENT_RegionRefIndexed.equals(localName)) {
+
+ int i;
+ if (currentRelation != null) {
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
+ currentRelation.add(atts.getValue(i));
+ }
+ }
+ else if (currentLogicalGroup != null) {
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_regionRef)) >= 0) {
+ currentLogicalGroup.addRegionRef(atts.getValue(i));
+ }
+ }
+ }
+ else if (DefaultXmlNames.ELEMENT_Layers.equals(localName)) {
+ layout.createLayers();
+ currentLogicalGroup = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_Layer.equals(localName)) {
+ Layer layer = layout.getLayers().createLayer();
+ currentLogicalGroup = layer;
+ int i;
+ //Z-Index
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_zIndex)) >= 0) {
+ layer.setZIndex(new Integer(atts.getValue(i)));
+ }
+ //Caption
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_caption)) >= 0) {
+ layer.setCaption(atts.getValue(i));
+ }
+ parseId(layer, atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_AlternativeImage.equals(localName)) {
+ handleAlternativeImage(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_Relation.equals(localName)) {
+ handleRelationStart(atts);
+ }
+ else if (DefaultXmlNames.ELEMENT_TextStyle.equals(localName)) {
+ parseTextStyle((ContentObject)currentTextObject, atts);
+ }
+ }
+
+ /**
+ * Receive notification of the end of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void endElement(String namespaceURI, String localName, String qName)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = null;
+
+ if (DefaultXmlNames.ELEMENT_Page.equals(localName)) {
+ finaliseRelations();
+ }
+ else if (DefaultXmlNames.ELEMENT_Border.equals(localName)) {
+ layout.setBorder(currentGeometricObject);
+ currentGeometricObject = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_PrintSpace.equals(localName)) {
+ layout.setPrintSpace(currentGeometricObject);
+ currentGeometricObject = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_TextRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ImageRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_GraphicRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_LineDrawingRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ChartRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_SeparatorRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_MathsRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_TableRegion.equals(localName)
+ //|| DefaultXmlNames.ELEMENT_FrameRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_NoiseRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnknownRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_AdvertRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_ChemRegion.equals(localName)
+ || DefaultXmlNames.ELEMENT_MusicRegion.equals(localName)
+ ) {
+ handleRegionEnd();
+ }
+ else if ( DefaultXmlNames.ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ currentGeometricObject = currentRegion; //Set to parent
+ currentTextObject = (TextObject)currentRegion;
+ }
+ else if ( DefaultXmlNames.ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ currentGeometricObject = currentTextLine; //Set to parent
+ currentTextObject = currentTextLine;
+ }
+ else if ( DefaultXmlNames.ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ currentGeometricObject = currentWord; //Set to parent
+ currentTextObject = currentWord;
+ }
+ else if (DefaultXmlNames.ELEMENT_ReadingOrder.equals(localName)) {
+ currentLogicalGroup = null;
+ readingOrder = null;
+ }
+ else if ( DefaultXmlNames.ELEMENT_OrderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_OrderedGroupIndexed.equals(localName)) {
+ currentLogicalGroup = currentLogicalGroup.getParent();
+ }
+ else if ( DefaultXmlNames.ELEMENT_UnorderedGroup.equals(localName)
+ || DefaultXmlNames.ELEMENT_UnorderedGroupIndexed.equals(localName)) {
+ currentLogicalGroup = currentLogicalGroup.getParent();
+ }
+ else if (DefaultXmlNames.ELEMENT_Layer.equals(localName)) {
+ currentLogicalGroup = null;
+ }
+ else if (DefaultXmlNames.ELEMENT_Relation.equals(localName)) {
+ currentRelation = null;
+ }
+ }
+
+ /**
+ * Receive notification of character data inside an element.
+ * @param ch - The characters.
+ * @param start - The start position in the character array.
+ * @param length - The number of characters to use from the character array.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void characters(char[] ch, int start, int length)
+ throws SAXException {
+
+ String strValue = new String(ch, start, length);
+
+ //Text might be parsed bit by bit, so we have to accumulate until a closing tag is found.
+ if (currentText == null)
+ currentText = new StringBuffer();
+ currentText.append(strValue);
+ }
+
+ /**
+ * Writes accumulated text to the right object.
+ */
+ private void finishText() {
+ if (currentText != null) {
+ String strValue = currentText.toString();
+
+ if (currentTextObject != null) {
+ if (DefaultXmlNames.ELEMENT_Unicode.equals(insideElement)) {
+ currentTextObject.setText(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_PlainText.equals(insideElement)) {
+ currentTextObject.setPlainText(strValue);
+ }
+ }
+ if (metaData != null) {
+ if (DefaultXmlNames.ELEMENT_Creator.equals(insideElement)) {
+ metaData.setCreator(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_Comments.equals(insideElement)) {
+ metaData.setComments(strValue);
+ }
+ else if (DefaultXmlNames.ELEMENT_Created.equals(insideElement)) {
+ metaData.setCreationTime(parseDate(strValue));
+ }
+ else if (DefaultXmlNames.ELEMENT_LastChange.equals(insideElement)) {
+ metaData.setLastModifiedTime(parseDate(strValue));
+ }
+ }
+
+ currentText = null;
+ }
+ }
+
+ private void createPageObject() {
+ if (validatorProvider != null && schemaVersion != null) {
+ try {
+ page = new Page(validatorProvider.getSchemaParser(schemaVersion));
+ //page.setFormatVersion(schemaVersion);
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ page = new Page();
+ }
+ }
+ else
+ page = new Page();
+
+ layout = page.getLayout();
+ metaData = page.getMetaData();
+ }
+
+ /**
+ * Reads the attributes of the Page element.
+ */
+ private void handlePageElement(Attributes atts) {
+ int i;
+
+ //Size
+ int width = 0;
+ int height = 0;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageWidth)) >= 0) {
+ width = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageHeight)) >= 0) {
+ height = new Integer(atts.getValue(i));
+ }
+ page.getLayout().setSize(width, height);
+
+ //Image filename
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_imageFilename)) >= 0) {
+ page.setImageFilename(atts.getValue(i));
+ }
+
+ //Other attributes (page type, ...)
+ handleAttributeContainer(page, atts);
+ }
+
+ /**
+ * Adds an alternative image to the list of images of the page object.
+ */
+ private void handleAlternativeImage(Attributes atts) {
+ if (page.getAlternativeImages() == null)
+ return;
+
+ AlternativeImage img = null;
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_filename)) >= 0) {
+ img = new AlternativeImage(atts.getValue(i));
+ page.getAlternativeImages().add(img);
+ }
+ else
+ return;
+
+ //Comments
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_comments)) >= 0) {
+ img.setComments(atts.getValue(i));
+ }
+ }
+
+ /**
+ * Handles attributes of the TextEquiv element.
+ */
+ private void handleTextEquiv(Attributes atts) {
+ if (currentTextObject == null)
+ return;
+
+ int i;
+
+ //OCR confidence
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_conf)) >= 0) {
+ Double confidence = new Double(atts.getValue(i));
+ currentTextObject.setConfidence(confidence);
+ }
+ }
+
+ /**
+ * Reads the attributes of a content object.
+ */
+ private void handleAttributeContainer(AttributeContainer obj, Attributes atts) {
+
+ //Id
+ //parseId(obj, atts);
+
+ //Attributes
+ VariableMap map = obj.getAttributes();
+ int p;
+ for (int i=0; i= 0) {
+ var.parseValue(atts.getValue(p));
+ }
+ }
+ }
+
+ /**
+ * Reads text style specific attributes
+ */
+ private void parseTextStyle(ContentObject obj, Attributes atts) {
+ if (obj == null)
+ return;
+ VariableMap objectAttrs = obj.getAttributes();
+ if (objectAttrs != null) {
+ int p;
+ for (int i=0; i= 0) {
+ var.parseValue(atts.getValue(p));
+ }
+ }
+ }
+ }
+
+ private void handleCoords(Attributes atts) {
+ if (currentGeometricObject != null) {
+ Polygon polygon = new Polygon();
+ currentGeometricObject.setCoords(polygon);
+ handlePointsAttribute(polygon, atts);
+ }
+ }
+
+ private void handleBaseline(Attributes atts) {
+ if (currentTextLine != null) {
+ Polygon baseline = new Polygon();
+ currentTextLine.setBaseline(baseline);
+ handlePointsAttribute(baseline, atts);
+ }
+ }
+
+ private void handleRelationStart(Attributes atts) {
+ if (tempRelations == null)
+ tempRelations = new ArrayList>();
+
+ int i;
+ //Type
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_type)) >= 0) {
+ List rel = new ArrayList();
+ tempRelations.add(rel);
+ currentRelation = rel;
+
+ rel.add(atts.getValue(i));
+
+ //Custom
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_custom)) >= 0) {
+ currentRelation.add(atts.getValue(i));
+ } else
+ currentRelation.add("");
+ //Comments
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_comments)) >= 0) {
+ currentRelation.add(atts.getValue(i));
+ } else
+ currentRelation.add("");
+ }
+ }
+
+ /**
+ * Translates the temporary relations data structure to the proper one.
+ */
+ private void finaliseRelations() {
+ if (tempRelations == null)
+ return;
+
+ PageLayout layout = page.getLayout();
+ Relations relations = layout.getRelations();
+ if (relations == null)
+ return;
+
+ for (int i=0; i rel = tempRelations.get(i);
+ if (rel != null && rel.size() == 5) {
+ RelationType type = "link".equals(rel.get(0)) ? RelationType.Link : RelationType.Join;
+ String custom = rel.get(1);
+ String comments = rel.get(2);
+ String id1 = rel.get(3);
+ String id2 = rel.get(4);
+
+ ContentObject obj1 = contentObjects.get(id1);
+ ContentObject obj2 = contentObjects.get(id2);
+
+ if (obj1 != null && obj2 != null) {
+ ContentObjectRelation relation = new ContentObjectRelation(obj1, obj2, type);
+ relation.setCustomField(custom);
+ relation.setComments(comments);
+ relations.addRelation(relation);
+ }
+ }
+ }
+ }
+
+
+ private void handlePointsAttribute(Polygon polygon, Attributes atts) {
+ //Points
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_points)) >= 0) {
+ String pointList = atts.getValue(i);
+
+ //Split using space
+ String[] pointStrings = pointList.split(" ");
+
+ for (i = 0; i= 0)
+ return atts.getValue(i);
+ return "";
+ }
+
+ /**
+ * Reads the ID attribute and sets it in the Identifiable object.
+ */
+ private void parseId(Identifiable ident, Attributes atts) {
+ int i;
+ if ((i = atts.getIndex(DefaultXmlNames.ATTR_id)) >= 0) {
+ try {
+ ident.setId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ //TODO Manage ID conflicts
+ e.printStackTrace();
+ }
+ }
+ }
+
+ private void handleRegion(Attributes atts, RegionType type) {
+ currentRegion = layout.createRegion(type, readId(atts), (RegionContainer)currentRegion); //Either adds it to the page (if currentRegion is null) or to the current region (as nested region)
+ regionStack.push(currentRegion);
+ currentGeometricObject = currentRegion;
+ contentObjects.put(currentRegion.getId().toString(), currentRegion);
+ handleAttributeContainer(currentRegion, atts);
+ }
+
+ private void handleRegionEnd() {
+ regionStack.pop();
+ if (!regionStack.isEmpty()) {
+ currentRegion = regionStack.lastElement();
+ if (currentRegion instanceof TextObject)
+ currentTextObject = (TextObject)currentRegion;
+ currentGeometricObject = currentRegion;
+ }
+ else {
+ currentRegion = null;
+ currentGeometricObject = null;
+ currentTextObject = null;
+ }
+ }
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_AbbyyFineReader10.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_AbbyyFineReader10.java
new file mode 100644
index 00000000..2144f42e
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_AbbyyFineReader10.java
@@ -0,0 +1,401 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+import org.primaresearch.maths.geometry.Polygon;
+import org.primaresearch.maths.geometry.Rect;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+
+/**
+ * Experimental SAX XML handler to read Abbyy FineReader 10 XML files.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandler_AbbyyFineReader10 extends SaxPageHandler {
+
+ private static final String ELEMENT_document = "document";
+ private static final String ELEMENT_page = "page";
+ private static final String ELEMENT_region = "region";
+ private static final String ELEMENT_rect = "rect";
+ private static final String ELEMENT_block = "block";
+
+ private static final String ATTR_producer = "producer";
+ private static final String ATTR_width = "width";
+ private static final String ATTR_height = "height";
+ private static final String ATTR_originalCoords = "originalCoords";
+ private static final String ATTR_l = "l";
+ private static final String ATTR_t = "t";
+ private static final String ATTR_r = "r";
+ private static final String ATTR_b = "b";
+ private static final String ATTR_blockType = "blockType";
+
+ private static Comparator rectComparator = new SortRectsVertically();
+
+ private Page page;
+ private PageLayout layout = null;
+ private MetaData metaData = null;
+
+ @SuppressWarnings("unused")
+ private String insideElement = null;
+ private List currentRects = null;
+ private GeometricObject currentGeometricObject = null;
+ private Region currentRegion = null;
+
+ private XmlModelAndValidatorProvider validatorProvider;
+ private XmlFormatVersion schemaVersion;
+
+ public SaxPageHandler_AbbyyFineReader10(XmlModelAndValidatorProvider validatorProvider, XmlFormatVersion schemaVersion) {
+ this.validatorProvider = validatorProvider;
+ this.schemaVersion = schemaVersion;
+ }
+
+ @Override
+ public Page getPageObject() {
+ return page;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = localName;
+
+ if (ELEMENT_document.equals(localName)){
+ createPageObject();
+ //Producer
+ int i;
+ if ((i = atts.getIndex(ATTR_producer)) >= 0) {
+ if (metaData != null)
+ metaData.setCreator(atts.getValue(i));
+ }
+ }
+ else if (ELEMENT_page.equals(localName)){
+ handlePageElement(atts);
+ }
+ else if (ELEMENT_region.equals(localName)) {
+ if (currentGeometricObject != null)
+ currentGeometricObject.setCoords(new Polygon());
+ currentRects = new ArrayList();
+ }
+ else if (ELEMENT_rect.equals(localName)) {
+ handleRegionRect(atts);
+ }
+ else if (ELEMENT_block.equals(localName)) {
+ currentRegion = createRegion(atts);
+ currentGeometricObject = currentRegion;
+ }
+ /*else if (ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ if (currentRegion != null && currentRegion.getType() == RegionType.TextRegion)
+ currentTextLine = ((TextRegion)currentRegion).createTextLine(readId(atts));
+ currentGeometricObject = currentTextLine;
+ currentTextObject = currentTextLine;
+ handleContentObject(currentTextLine, atts);
+ }
+ else if (ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ if (currentTextLine != null)
+ currentWord = currentTextLine.createWord(readId(atts));
+ currentGeometricObject = currentWord;
+ currentTextObject = currentWord;
+ handleContentObject(currentWord, atts);
+ }
+ else if (ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ if (currentWord != null)
+ currentGlyph = currentWord.createGlyph(readId(atts));
+ currentGeometricObject = currentGlyph;
+ currentTextObject = currentGlyph;
+ handleContentObject(currentGlyph, atts);
+ }*/
+ }
+
+ /**
+ * Receive notification of the end of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void endElement(String namespaceURI, String localName, String qName)
+ throws SAXException {
+
+ //Handle accumulated text
+ finishText();
+
+ insideElement = null;
+
+ if ( ELEMENT_block.equals(localName)) {
+ if (currentRects != null && !currentRects.isEmpty() && currentGeometricObject != null) {
+ Polygon polygon = convertToPolygon(currentRects);
+ currentGeometricObject.setCoords(polygon);
+ }
+ currentRegion = null;
+ currentGeometricObject = null;
+ }
+ /*else if ( ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ currentGeometricObject = currentRegion; //Set to parent
+ currentTextObject = (TextObject)currentRegion;
+ }
+ else if ( ELEMENT_Word.equals(localName)) {
+ currentWord = null;
+ currentGeometricObject = currentTextLine; //Set to parent
+ currentTextObject = currentTextLine;
+ }
+ else if ( ELEMENT_Glyph.equals(localName)) {
+ currentGlyph = null;
+ currentGeometricObject = currentWord; //Set to parent
+ currentTextObject = currentWord;
+ }*/
+ }
+
+ /**
+ * Receive notification of character data inside an element.
+ * @param ch - The characters.
+ * @param start - The start position in the character array.
+ * @param length - The number of characters to use from the character array.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void characters(char[] ch, int start, int length)
+ throws SAXException {
+
+ //String strValue = new String(ch, start, length);
+
+ //Text might be parsed bit by bit, so we have to accumulate until a closing tag is found.
+ //if (currentText == null)
+ // currentText = new StringBuffer();
+ //currentText.append(strValue);
+ }
+
+ /**
+ * Writes accumulated text to the right object.
+ */
+ private void finishText() {
+ /*if (currentText != null) {
+ String strValue = currentText.toString();
+
+ if (currentTextObject != null) {
+ if (ELEMENT_Unicode.equals(insideElement)) {
+ currentTextObject.setText(strValue);
+ }
+ else if (ELEMENT_PlainText.equals(insideElement)) {
+ currentTextObject.setPlainText(strValue);
+ }
+ }
+ if (metaData != null) {
+ if (ELEMENT_Creator.equals(insideElement)) {
+ metaData.setCreator(strValue);
+ }
+ else if (ELEMENT_Comments.equals(insideElement)) {
+ metaData.setComments(strValue);
+ }
+ else if (ELEMENT_Created.equals(insideElement)) {
+ metaData.setCreationTime(parseDate(strValue));
+ }
+ else if (ELEMENT_LastChange.equals(insideElement)) {
+ metaData.setLastModifiedTime(parseDate(strValue));
+ }
+ }
+
+ currentText = null;
+ }*/
+ }
+
+ private void createPageObject() {
+ if (validatorProvider != null && schemaVersion != null) {
+ try {
+ page = new Page(validatorProvider.getSchemaParser(validatorProvider.getLatestSchemaVersion()));
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ page = new Page();
+ }
+ }
+ else
+ page = new Page();
+
+ layout = page.getLayout();
+ metaData = page.getMetaData();
+ }
+
+ /**
+ * Reads the attributes of the Page element.
+ */
+ private void handlePageElement(Attributes atts) {
+ int i;
+
+ //Size
+ int width = 0;
+ int height = 0;
+ if ((i = atts.getIndex(ATTR_width)) >= 0) {
+ width = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(ATTR_height)) >= 0) {
+ height = new Integer(atts.getValue(i));
+ }
+ page.getLayout().setSize(width, height);
+
+ //Original coords (1==coords relative to original image, otherwise coords relative to deskewed image)
+ if ((i = atts.getIndex(ATTR_originalCoords)) >= 0) {
+ if (metaData != null) {
+ String comments = metaData.getComments();
+ if (comments == null)
+ comments = "";
+ else
+ comments += "\n";
+ comments += "Original coords: "+(new Integer(atts.getValue(i)).equals(1) ? "true" : "false");
+ }
+ }
+ }
+
+ /**
+ * Reads the coordinates of a single rectangle.
+ */
+ private void handleRegionRect(Attributes atts) {
+ if (currentRects == null)
+ return;
+
+
+ int i, l=0, t=0, r=0, b=0;
+ if ((i = atts.getIndex(ATTR_l)) >= 0) {
+ l = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(ATTR_t)) >= 0) {
+ t = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(ATTR_r)) >= 0) {
+ r = new Integer(atts.getValue(i));
+ }
+ if ((i = atts.getIndex(ATTR_b)) >= 0) {
+ b = new Integer(atts.getValue(i));
+ }
+ currentRects.add(new Rect(l, t, r, b));
+ }
+
+
+ //private String getXmlAttributeName(String name) {
+ // return name; //TODO Should there be a mechanism to translate attribute names to XML names?
+ //}
+
+ private Region createRegion(Attributes atts) {
+ String abbyyType = null;
+ int i;
+ if ((i = atts.getIndex(ATTR_blockType)) >= 0) {
+ abbyyType = atts.getValue(i);
+ }
+
+ RegionType primaType = RegionType.UnknownRegion;
+ if ("Text".equals(abbyyType))
+ primaType = RegionType.TextRegion;
+ else if ("Table".equals(abbyyType))
+ primaType = RegionType.TableRegion;
+ else if ("Picture".equals(abbyyType))
+ primaType = RegionType.ImageRegion;
+ else if ("Barcode".equals(abbyyType))
+ primaType = RegionType.GraphicRegion;
+ else if ("Separator".equals(abbyyType))
+ primaType = RegionType.SeparatorRegion;
+ else if ("SeparatorsBox".equals(abbyyType))
+ primaType = RegionType.SeparatorRegion;
+
+ return layout.createRegion(primaType);
+ }
+
+ /**
+ * Converts a stack of rectangles to a polygon.
+ */
+ private Polygon convertToPolygon(List rects) {
+
+ if (rects.isEmpty())
+ return null;
+
+ Polygon polygon = new Polygon();
+
+ //One rectangle
+ if (rects.size() == 1) {
+ Rect rect = rects.get(0);
+ polygon.addPoint(rect.left, rect.top);
+ polygon.addPoint(rect.right, rect.top);
+ polygon.addPoint(rect.right, rect.bottom);
+ polygon.addPoint(rect.left, rect.bottom);
+ }
+ //Multiple rectangles
+ else {
+ //Sort rects vertically
+ Collections.sort(rects, rectComparator);
+
+ //Create polygon
+ // Right sides
+ Rect rect;
+ for (int i=0; i=0; i--) {
+ rect = rects.get(i);
+ polygon.addPoint(rect.left, rect.bottom);
+ polygon.addPoint(rect.left, rect.top);
+ }
+ }
+
+ return polygon;
+ }
+
+ private static class SortRectsVertically implements Comparator {
+ @Override
+ public int compare(Rect rect1, Rect rect2) {
+ int center1 = (rect1.top + rect1.bottom) / 2;
+ int center2 = (rect2.top + rect2.bottom) / 2;
+ if (center1 < center2)
+ return -1;
+ if (center1 > center2)
+ return 1;
+ return 0;
+ }
+ }
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Alto_2_1.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Alto_2_1.java
new file mode 100644
index 00000000..288914cb
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Alto_2_1.java
@@ -0,0 +1,1064 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.Page.MeasurementUnit;
+import org.primaresearch.dla.page.layout.GeometricObjectImpl;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.physical.ContentIterator;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.impl.ImageRegion;
+import org.primaresearch.dla.page.layout.physical.impl.SeparatorRegion;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextLine;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextRegion;
+import org.primaresearch.dla.page.layout.physical.text.impl.Word;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider;
+import org.primaresearch.io.xml.XmlModelAndValidatorProvider.UnsupportedSchemaVersionException;
+import org.primaresearch.maths.geometry.Polygon;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+
+/**
+ * Experimental SAX XML handler to read ALTO XML files.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandler_Alto_2_1 extends SaxPageHandler {
+
+ private static final String ELEMENT_alto = "alto";
+ //private static final String ELEMENT_Description = "Description";
+ //private static final String ELEMENT_Layout = "Layout";
+ private static final String ELEMENT_MeasurementUnit = "MeasurementUnit";
+ //private static final String ELEMENT_sourceImageInformation = "sourceImageInformation";
+ private static final String ELEMENT_fileName = "fileName";
+ private static final String ELEMENT_OCRProcessing = "OCRProcessing";
+ private static final String ELEMENT_preProcessingStep = "preProcessingStep";
+ private static final String ELEMENT_ocrProcessingStep = "ocrProcessingStep";
+ private static final String ELEMENT_postProcessingStep = "postProcessingStep";
+ private static final String ELEMENT_processingDateTime = "processingDateTime";
+ private static final String ELEMENT_processingAgency = "processingAgency";
+ private static final String ELEMENT_processingStepDescription = "processingStepDescription";
+ private static final String ELEMENT_processingStepSettings = "processingStepSettings";
+ private static final String ELEMENT_processingSoftware = "processingSoftware";
+ private static final String ELEMENT_softwareName = "softwareName";
+ private static final String ELEMENT_softwareCreator = "softwareCreator";
+ private static final String ELEMENT_softwareVersion = "softwareVersion";
+ private static final String ELEMENT_applicationDescription = "applicationDescription";
+ private static final String ELEMENT_Page = "Page";
+ private static final String ELEMENT_TopMargin = "TopMargin";
+ private static final String ELEMENT_LeftMargin = "LeftMargin";
+ private static final String ELEMENT_RightMargin = "RightMargin";
+ private static final String ELEMENT_BottomMargin = "BottomMargin";
+ private static final String ELEMENT_PrintSpace = "PrintSpace";
+ private static final String ELEMENT_TextBlock = "TextBlock";
+ private static final String ELEMENT_Illustration = "Illustration";
+ private static final String ELEMENT_GraphicalElement = "GraphicalElement";
+ private static final String ELEMENT_ComposedBlock = "ComposedBlock";
+ //private static final String ELEMENT_Shape = "Shape";
+ private static final String ELEMENT_Polygon = "Polygon";
+ private static final String ELEMENT_Ellipse = "Ellipse";
+ private static final String ELEMENT_Circle = "Circle";
+ private static final String ELEMENT_TextLine = "TextLine";
+ private static final String ELEMENT_String = "String";
+ private static final String ELEMENT_HYP = "HYP";
+
+ private static final String ATTR_ID = "ID";
+ private static final String ATTR_IDNEXT = "IDNEXT";
+ private static final String ATTR_PAGECLASS = "PAGECLASS";
+ private static final String ATTR_HEIGHT = "HEIGHT";
+ private static final String ATTR_WIDTH = "WIDTH";
+ private static final String ATTR_PHYSICAL_IMG_NR = "PHYSICAL_IMG_NR";
+ private static final String ATTR_PRINTED_IMG_NR = "PRINTED_IMG_NR";
+ private static final String ATTR_QUALITY = "QUALITY";
+ private static final String ATTR_QUALITY_DETAIL = "QUALITY_DETAIL";
+ private static final String ATTR_POSITION = "POSITION";
+ private static final String ATTR_PROCESSING = "PROCESSING";
+ private static final String ATTR_ACCURACY = "ACCURACY";
+ private static final String ATTR_PC = "PC";
+ private static final String ATTR_HPOS = "HPOS";
+ private static final String ATTR_VPOS = "VPOS";
+ private static final String ATTR_ROTATION = "ROTATION";
+ private static final String ATTR_POINTS = "POINTS";
+ private static final String ATTR_RADIUS = "RADIUS";
+ private static final String ATTR_HLENGTH = "HLENGTH";
+ private static final String ATTR_VLENGTH = "VLENGTH";
+ private static final String ATTR_CONTENT = "CONTENT";
+ private static final String ATTR_LANG = "LANG";
+
+
+ private Page page;
+ private PageLayout layout = null;
+ private MetaData metaData = null;
+ private String comments;
+ private StringBuffer currentTextBuffer;
+ private String currentText;
+ private MeasurementUnit measurementUnit = MeasurementUnit.MM_BY_10; //Default
+ private boolean firstPageDone = false;
+ private boolean hasMargins = false;
+ private Region currentRegion = null;
+ private TextLine currentTextLine = null;
+ private Word currentWord = null;
+ private Word lastWord = null;
+ private Map> idPartialReadingOrderMap = new HashMap>();
+ private List> partialReadingOrder = new ArrayList>();
+
+ private XmlModelAndValidatorProvider validatorProvider;
+ private XmlFormatVersion schemaVersion;
+
+
+ public SaxPageHandler_Alto_2_1(XmlModelAndValidatorProvider validatorProvider, XmlFormatVersion schemaVersion) {
+ this.validatorProvider = validatorProvider;
+ this.schemaVersion = schemaVersion;
+ }
+
+
+ @Override
+ public Page getPageObject() {
+ return page;
+ }
+
+ private void createPageObject() {
+ if (validatorProvider != null && schemaVersion != null) {
+ try {
+ page = new Page(validatorProvider.getSchemaParser(validatorProvider.getLatestSchemaVersion()));
+ } catch (UnsupportedSchemaVersionException e) {
+ e.printStackTrace();
+ page = new Page();
+ }
+ }
+ else
+ page = new Page();
+
+ layout = page.getLayout();
+ metaData = page.getMetaData();
+ comments = "Converted from ALTO";
+ page.setMeasurementUnit(measurementUnit);
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+
+ if (firstPageDone) //No multi-page support
+ return;
+
+ //Handle accumulated text
+ finishText();
+
+ if (ELEMENT_alto.equals(localName)){
+ createPageObject();
+ }
+ else if (ELEMENT_OCRProcessing.equals(localName)){
+ comments += "\nOCR Processing Information";
+ }
+ else if (ELEMENT_preProcessingStep.equals(localName)){
+ comments += "\nPreprocessing:";
+ }
+ else if (ELEMENT_ocrProcessingStep.equals(localName)){
+ comments += "\nOCR:";
+ }
+ else if (ELEMENT_postProcessingStep.equals(localName)){
+ comments += "\nPostprocessing:";
+ }
+ else if (ELEMENT_Page.equals(localName)){
+ handlePageNode(atts);
+ }
+ else if (ELEMENT_TopMargin.equals(localName)
+ || ELEMENT_LeftMargin.equals(localName)
+ || ELEMENT_BottomMargin.equals(localName)
+ || ELEMENT_RightMargin.equals(localName)){
+ hasMargins = true;
+ }
+ else if (ELEMENT_PrintSpace.equals(localName)) {
+ handlePrintSpaceNode(atts);
+ }
+ else if (ELEMENT_TextBlock.equals(localName)) {
+ handleBlockNode(atts, RegionType.TextRegion);
+ handleTextBlock(atts);
+ }
+ else if (ELEMENT_Illustration.equals(localName)) {
+ handleBlockNode(atts, RegionType.ImageRegion);
+ handleIllustrationBlock(atts);
+ }
+ else if (ELEMENT_GraphicalElement.equals(localName)) {
+ handleBlockNode(atts, RegionType.SeparatorRegion);
+ handleGraphicsBlock(atts);
+ }
+ else if (ELEMENT_ComposedBlock.equals(localName)) {
+ //At the moment we do not create a region for composed blocks.
+ //Only non-composed children get a region.
+ }
+ else if (ELEMENT_Polygon.equals(localName)) {
+ handlePolygonNode(atts);
+ }
+ else if (ELEMENT_Ellipse.equals(localName)) {
+ handleEllipseNode(atts);
+ }
+ else if (ELEMENT_Circle.equals(localName)) {
+ handleCircleNode(atts);
+ }
+ else if (ELEMENT_TextLine.equals(localName)) {
+ handleTextLineNode(atts);
+ }
+ else if (ELEMENT_String.equals(localName)) {
+ handleWordNode(atts);
+ }
+ else if (ELEMENT_HYP.equals(localName)) {
+ handleHyphenNode(atts);
+ }
+ }
+
+
+ /**
+ * Receive notification of the end of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void endElement(String namespaceURI, String localName, String qName)
+ throws SAXException {
+
+
+ //Handle accumulated text
+ finishText();
+
+ if (ELEMENT_alto.equals(localName)){
+ metaData.setComments(comments);
+ createReadingOrder();
+ composehighLevelText();
+ }
+
+ if (firstPageDone) //No multi-page support
+ return;
+
+ if (ELEMENT_MeasurementUnit.equals(localName)){
+ handleMeasurementUnit(currentText);
+ }
+ else if (ELEMENT_fileName.equals(localName)){
+ handleFilename(currentText);
+ }
+ else if (ELEMENT_processingDateTime.equals(localName)){
+ comments += "\n Date: "+currentText;
+ }
+ else if (ELEMENT_processingAgency.equals(localName)){
+ comments += "\n Agency: "+currentText;
+ }
+ else if (ELEMENT_processingStepDescription.equals(localName)){
+ comments += "\n Description: "+currentText;
+ }
+ else if (ELEMENT_processingStepSettings.equals(localName)){
+ comments += "\n Settings: "+currentText;
+ }
+ else if (ELEMENT_processingSoftware.equals(localName)){
+ comments += "\n Software: ";
+ }
+ else if (ELEMENT_softwareCreator.equals(localName)){
+ comments += "\n Creator: "+currentText;
+ }
+ else if (ELEMENT_softwareName.equals(localName)){
+ comments += "\n Name: "+currentText;
+ }
+ else if (ELEMENT_softwareVersion.equals(localName)){
+ comments += "\n Version: "+currentText;
+ }
+ else if (ELEMENT_applicationDescription.equals(localName)){
+ comments += "\n Application: "+currentText;
+ }
+ else if (ELEMENT_Page.equals(localName)){
+ firstPageDone = true;
+ }
+ else if (ELEMENT_TextBlock.equals(localName)) {
+ currentRegion = null;
+ }
+ else if (ELEMENT_Illustration.equals(localName)) {
+ currentRegion = null;
+ }
+ else if (ELEMENT_GraphicalElement.equals(localName)) {
+ currentRegion = null;
+ }
+ else if (ELEMENT_ComposedBlock.equals(localName)) {
+ currentRegion = null;
+ }
+ else if (ELEMENT_TextLine.equals(localName)) {
+ currentTextLine = null;
+ lastWord = null;
+ }
+ else if (ELEMENT_String.equals(localName)) {
+ currentWord = null;
+ }
+
+ }
+
+ /**
+ * Receive notification of character data inside an element.
+ * @param ch - The characters.
+ * @param start - The start position in the character array.
+ * @param length - The number of characters to use from the character array.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void characters(char[] ch, int start, int length)
+ throws SAXException {
+
+ if (firstPageDone) //No multi-page support
+ return;
+
+ String strValue = new String(ch, start, length);
+
+ //Text might be parsed bit by bit, so we have to accumulate until a closing tag is found.
+ if (currentTextBuffer == null)
+ currentTextBuffer = new StringBuffer();
+ currentTextBuffer.append(strValue);
+
+ }
+
+ /**
+ * Writes accumulated text to the right object.
+ */
+ private void finishText() {
+ if (firstPageDone) //No multi-page support
+ return;
+ if (currentTextBuffer != null) {
+ currentText = currentTextBuffer.toString();
+
+ currentTextBuffer = null;
+ }
+ }
+
+ /**
+ * Creates the text region and text line text content from the words
+ */
+ private void composehighLevelText() {
+ for (ContentIterator it = layout.iterator(RegionType.TextRegion); it.hasNext(); ) {
+ ((TextRegion)it.next()).composeText(true, true);
+ }
+ }
+
+ private void handleMeasurementUnit(String textContent) {
+ comments += "\n\nMeasurement unit: "+textContent;
+
+ if ("pixel".equals(textContent))
+ measurementUnit = MeasurementUnit.PIXEL;
+ else if ("mm10".equals(textContent))
+ measurementUnit = MeasurementUnit.MM_BY_10;
+ else if ("inch1200".equals(textContent))
+ measurementUnit = MeasurementUnit.INCH_BY_1200;
+
+ if (page != null)
+ page.setMeasurementUnit(measurementUnit);
+ }
+
+ private void handleFilename(String textContent) {
+ page.setImageFilename(textContent);
+ }
+
+ private void handlePageNode(Attributes atts) {
+ int i;
+ //Id
+ if ((i = atts.getIndex(ATTR_ID)) >= 0) {
+ try {
+ page.setGtsId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+ //Page class
+ if ((i = atts.getIndex(ATTR_PAGECLASS)) >= 0) {
+ comments += "\nPage class: "+atts.getValue(i);
+ }
+ //Width + height
+ if ((i = atts.getIndex(ATTR_WIDTH)) >= 0) {
+ int width = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HEIGHT)) >= 0)
+ layout.setSize(width, (int)Double.parseDouble(atts.getValue(i)));
+ }
+ //Physical image number
+ if ((i = atts.getIndex(ATTR_PHYSICAL_IMG_NR)) >= 0) {
+ comments += "\nPhysical image number: "+atts.getValue(i);
+ }
+ //Printed image number
+ if ((i = atts.getIndex(ATTR_PRINTED_IMG_NR)) >= 0) {
+ comments += "\nPrinted image number: "+atts.getValue(i);
+ }
+ //Quality
+ if ((i = atts.getIndex(ATTR_QUALITY)) >= 0) {
+ comments += "\nQuality: "+atts.getValue(i);
+ }
+ //Quality details
+ if ((i = atts.getIndex(ATTR_QUALITY_DETAIL)) >= 0) {
+ comments += "\n Quality details: "+atts.getValue(i);
+ }
+ //Position
+ if ((i = atts.getIndex(ATTR_POSITION)) >= 0) {
+ comments += "\nPosition: "+atts.getValue(i);
+ }
+ //Processing ID
+ if ((i = atts.getIndex(ATTR_PROCESSING)) >= 0) {
+ comments += "\nProcessing ID: "+atts.getValue(i);
+ }
+ //Accuracy
+ if ((i = atts.getIndex(ATTR_ACCURACY)) >= 0) {
+ comments += "\nAccuracy: "+atts.getValue(i);
+ }
+ //Confidence
+ if ((i = atts.getIndex(ATTR_PC)) >= 0) {
+ comments += "\nConfidence: "+atts.getValue(i);
+ }
+ }
+
+ private void handlePrintSpaceNode(Attributes atts) {
+ int i;
+
+ //Width + height
+ if (!hasMargins && layout.getWidth() <= 0) {
+ //No width defined in page node
+ //and there is no margin (meaning the print space equals the full page)
+ if ((i = atts.getIndex(ATTR_WIDTH)) >= 0) {
+ int width = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HEIGHT)) >= 0)
+ layout.setSize(width, (int)Double.parseDouble(atts.getValue(i)));
+ }
+ }
+
+ //Polygon
+ if (hasMargins) {
+ int w=0, h=0, l=0, t=0;
+ if ((i = atts.getIndex(ATTR_WIDTH)) >= 0)
+ w = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HEIGHT)) >= 0)
+ h = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HPOS)) >= 0)
+ l = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VPOS)) >= 0)
+ t = (int)Double.parseDouble(atts.getValue(i));
+ layout.setPrintSpace(new GeometricObjectImpl(createPolygonFromBoundingBox(l, t, w, h)));
+ }
+ }
+
+ private Polygon createPolygonFromBoundingBox(int left, int top, int width, int height) {
+ int right = left+width-1;
+ int bottom = top+height-1;
+ Polygon polygon = new Polygon();
+
+ polygon.addPoint(left, top);
+ polygon.addPoint(right, top);
+ polygon.addPoint(right, bottom);
+ polygon.addPoint(left, bottom);
+ return polygon;
+ }
+
+ private void handleBlockNode(Attributes atts, RegionType type) {
+ int i;
+ //Id
+ String id = null;
+ if ((i = atts.getIndex(ATTR_ID)) >= 0)
+ id = atts.getValue(i);
+
+ if (id != null)
+ currentRegion = layout.createRegion(type, id);
+ else
+ currentRegion = layout.createRegion(type);
+
+ //IdNext
+ if ((i = atts.getIndex(ATTR_IDNEXT)) >= 0)
+ addRelationToReadingOrder(id, atts.getValue(i));
+
+ //Polygon from bounding box
+ int w=0, h=0, l=0, t=0;
+ if ((i = atts.getIndex(ATTR_WIDTH)) >= 0)
+ w = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HEIGHT)) >= 0)
+ h = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HPOS)) >= 0)
+ l = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VPOS)) >= 0)
+ t = (int)Double.parseDouble(atts.getValue(i));
+ currentRegion.setCoords(createPolygonFromBoundingBox(l, t, w, h));
+ }
+
+ private void handleTextBlock(Attributes atts) {
+ int i;
+ TextRegion region = (TextRegion)currentRegion;
+
+ //Rotation
+ if ((i = atts.getIndex(ATTR_ROTATION)) >= 0)
+ region.setOrientation(Double.parseDouble(atts.getValue(i)));
+
+ //Language
+ if ((i = atts.getIndex(ATTR_LANG)) >= 0) {
+ String language = altoToPrimaLanguage(atts.getValue(i));
+ if (language != null)
+ region.setPrimaryLanguage(language);
+ }
+ }
+
+ private void handleTextLineNode(Attributes atts) {
+ if (currentRegion == null || !(currentRegion instanceof TextRegion))
+ return;
+
+
+ //Create line
+ currentTextLine = ((TextRegion)currentRegion).createTextLine();
+
+ int i;
+
+ //Language
+ if ((i = atts.getIndex(ATTR_LANG)) >= 0) {
+ String language = altoToPrimaLanguage(atts.getValue(i));
+ if (language != null)
+ currentTextLine.setPrimaryLanguage(language);
+ }
+
+ //Polygon from bounding box
+ int w=0, h=0, l=0, t=0;
+ if ((i = atts.getIndex(ATTR_WIDTH)) >= 0)
+ w = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HEIGHT)) >= 0)
+ h = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HPOS)) >= 0)
+ l = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VPOS)) >= 0)
+ t = (int)Double.parseDouble(atts.getValue(i));
+ currentTextLine.setCoords(createPolygonFromBoundingBox(l, t, w, h));
+ }
+
+ private void handleWordNode(Attributes atts) {
+ if (currentTextLine == null)
+ return;
+
+ //Create word
+ currentWord = currentTextLine.createWord();
+ lastWord = currentWord;
+
+ int i;
+
+ //Language
+ if ((i = atts.getIndex(ATTR_LANG)) >= 0) {
+ String language = altoToPrimaLanguage(atts.getValue(i));
+ if (language != null)
+ currentWord.setLanguage(language);
+ }
+
+ //Polygon from bounding box
+ int w=0, h=0, l=0, t=0;
+ if ((i = atts.getIndex(ATTR_WIDTH)) >= 0)
+ w = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HEIGHT)) >= 0)
+ h = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HPOS)) >= 0)
+ l = (int)Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VPOS)) >= 0)
+ t = (int)Double.parseDouble(atts.getValue(i));
+ currentWord.setCoords(createPolygonFromBoundingBox(l, t, w, h));
+
+ //Text content
+ if ((i = atts.getIndex(ATTR_CONTENT)) >= 0)
+ currentWord.setText(atts.getValue(i));
+ }
+
+ private void handleHyphenNode(Attributes atts) {
+ if (lastWord == null)
+ return;
+
+ //Attach the hyphen to the last word
+ // Text
+ int i;
+ String hypContent = "-";
+ if ((i = atts.getIndex(ATTR_CONTENT)) >= 0) {
+ char code = 0;
+ //Is it a number?
+ try {
+ code = (char)Integer.parseInt(atts.getValue(i));
+ } catch (NumberFormatException exc) {
+ //No need to handle
+ }
+ if (code > 0) { //We got a number
+ //Assuming Unicode
+ hypContent = new String(Character.toString(code));
+ }
+ else //If the conversion to int doesn't return a positive number, we assume the content is the actual character
+ hypContent = atts.getValue(i);
+ }
+ lastWord.setText(lastWord.getText() + hypContent);
+
+ // Coords (extend word)
+ int updatedRight = 0;
+ Polygon polygon = lastWord.getCoords();
+ if (atts.getIndex(ATTR_HPOS) >= 0 && atts.getIndex(ATTR_WIDTH) >= 0) {
+ int hpos = (int)Double.parseDouble(atts.getValue(ATTR_HPOS));
+ int width = (int)Double.parseDouble(atts.getValue(ATTR_WIDTH));
+
+ updatedRight = hpos + width - 1;
+ }
+ //Missing coordinates
+ else {
+ //Use text line
+ TextLine parentLine = (TextLine)lastWord.getParent();
+ if (parentLine != null && parentLine.getCoords() != null)
+ updatedRight = parentLine.getCoords().getBoundingBox().right;
+ }
+ if (polygon.getBoundingBox().right < updatedRight) {
+ polygon.getPoint(1).x = updatedRight;
+ polygon.getPoint(2).x = updatedRight;
+ lastWord.getCoords().setBoundingBoxOutdated();
+ }
+ }
+
+ private void handleIllustrationBlock(Attributes atts) {
+ int i;
+ ImageRegion region = (ImageRegion)currentRegion;
+
+ //Rotation
+ if ((i = atts.getIndex(ATTR_ROTATION)) >= 0)
+ region.setOrientation(Double.parseDouble(atts.getValue(i)));
+ }
+
+ private void handleGraphicsBlock(Attributes atts) {
+ int i;
+ SeparatorRegion region = (SeparatorRegion)currentRegion;
+
+ //Rotation
+ if ((i = atts.getIndex(ATTR_ROTATION)) >= 0)
+ region.setOrientation(Double.parseDouble(atts.getValue(i)));
+ }
+
+ private String altoToPrimaLanguage(String altoLanguageValue) {
+ if ("ab".equals(altoLanguageValue)) return "Abkhaz";
+ if ("aa".equals(altoLanguageValue)) return "Afar";
+ if ("af".equals(altoLanguageValue)) return "Afrikaans";
+ if ("ak".equals(altoLanguageValue)) return "Akan";
+ if ("sq".equals(altoLanguageValue)) return "Albanian";
+ if ("am".equals(altoLanguageValue)) return "Amharic";
+ if ("ar".equals(altoLanguageValue)) return "Arabic";
+ if ("an".equals(altoLanguageValue)) return "Aragonese";
+ if ("hy".equals(altoLanguageValue)) return "Armenian";
+ if ("as".equals(altoLanguageValue)) return "Assamese";
+ if ("av".equals(altoLanguageValue)) return "Avaric";
+ if ("ae".equals(altoLanguageValue)) return "Avestan";
+ if ("ay".equals(altoLanguageValue)) return "Aymara";
+ if ("az".equals(altoLanguageValue)) return "Azerbaijani";
+ if ("bm".equals(altoLanguageValue)) return "Bambara";
+ if ("ba".equals(altoLanguageValue)) return "Bashkir";
+ if ("eu".equals(altoLanguageValue)) return "Basque";
+ if ("be".equals(altoLanguageValue)) return "Belarusian";
+ if ("bn".equals(altoLanguageValue)) return "Bengali";
+ if ("bh".equals(altoLanguageValue)) return "Bihari";
+ if ("bi".equals(altoLanguageValue)) return "Bislama";
+ if ("bs".equals(altoLanguageValue)) return "Bosnian";
+ if ("br".equals(altoLanguageValue)) return "Breton";
+ if ("bg".equals(altoLanguageValue)) return "Bulgarian";
+ if ("my".equals(altoLanguageValue)) return "Burmese";
+ if ("km".equals(altoLanguageValue)) return "Cambodian";
+ //if ("".equals(altoLanguageValue)) return "Cantonese";
+ if ("ca".equals(altoLanguageValue)) return "Catalan";
+ if ("ch".equals(altoLanguageValue)) return "Chamorro";
+ if ("ce".equals(altoLanguageValue)) return "Chechen";
+ if ("ny".equals(altoLanguageValue)) return "Chichewa";
+ if ("zh".equals(altoLanguageValue)) return "Chinese";
+ if ("cv".equals(altoLanguageValue)) return "Chuvash";
+ if ("kw".equals(altoLanguageValue)) return "Cornish";
+ if ("co".equals(altoLanguageValue)) return "Corsican";
+ if ("cr".equals(altoLanguageValue)) return "Cree";
+ if ("hr".equals(altoLanguageValue)) return "Croatian";
+ if ("cs".equals(altoLanguageValue)) return "Czech";
+ if ("da".equals(altoLanguageValue)) return "Danish";
+ if ("dv".equals(altoLanguageValue)) return "Divehi";
+ if ("nl".equals(altoLanguageValue)) return "Dutch";
+ if ("dz".equals(altoLanguageValue)) return "Dzongkha";
+ if ("en".equals(altoLanguageValue)) return "English";
+ if ("en-GB".equals(altoLanguageValue)) return "English";
+ if ("en-US".equals(altoLanguageValue)) return "English";
+ if ("eo".equals(altoLanguageValue)) return "Esperanto";
+ if ("et".equals(altoLanguageValue)) return "Estonian";
+ if ("ee".equals(altoLanguageValue)) return "Ewe";
+ if ("fo".equals(altoLanguageValue)) return "Faroese";
+ if ("fj".equals(altoLanguageValue)) return "Fijian";
+ if ("fi".equals(altoLanguageValue)) return "Finnish";
+ if ("fr".equals(altoLanguageValue)) return "French";
+ if ("ff".equals(altoLanguageValue)) return "Fula";
+ if ("gd".equals(altoLanguageValue)) return "Gaelic";
+ if ("gl".equals(altoLanguageValue)) return "Galician";
+ if ("lg".equals(altoLanguageValue)) return "Ganda";
+ if ("ka".equals(altoLanguageValue)) return "Georgian";
+ if ("de".equals(altoLanguageValue)) return "German";
+ if ("el".equals(altoLanguageValue)) return "Greek";
+ if ("gn".equals(altoLanguageValue)) return "GuaranÃ";
+ if ("gu".equals(altoLanguageValue)) return "Gujarati";
+ if ("ht".equals(altoLanguageValue)) return "Haitian";
+ if ("ha".equals(altoLanguageValue)) return "Hausa";
+ if ("he".equals(altoLanguageValue)) return "Hebrew";
+ if ("hz".equals(altoLanguageValue)) return "Herero";
+ if ("hi".equals(altoLanguageValue)) return "Hindi";
+ if ("ho".equals(altoLanguageValue)) return "Hiri Motu";
+ if ("hu".equals(altoLanguageValue)) return "Hungarian";
+ if ("is".equals(altoLanguageValue)) return "Icelandic";
+ if ("io".equals(altoLanguageValue)) return "Ido";
+ if ("ig".equals(altoLanguageValue)) return "Igbo";
+ if ("id".equals(altoLanguageValue)) return "Indonesian";
+ if ("ia".equals(altoLanguageValue)) return "Interlingua";
+ if ("ie".equals(altoLanguageValue)) return "Interlingue";
+ if ("iu".equals(altoLanguageValue)) return "Inuktitut";
+ if ("ik".equals(altoLanguageValue)) return "Inupiaq";
+ if ("ga".equals(altoLanguageValue)) return "Irish";
+ if ("it".equals(altoLanguageValue)) return "Italian";
+ if ("ja".equals(altoLanguageValue)) return "Japanese";
+ if ("jv".equals(altoLanguageValue)) return "Javanese";
+ if ("kl".equals(altoLanguageValue)) return "Kalaallisut";
+ if ("kn".equals(altoLanguageValue)) return "Kannada";
+ if ("kr".equals(altoLanguageValue)) return "Kanuri";
+ if ("ks".equals(altoLanguageValue)) return "Kashmiri";
+ if ("kk".equals(altoLanguageValue)) return "Kazakh";
+ if ("km".equals(altoLanguageValue)) return "Khmer";
+ if ("ki".equals(altoLanguageValue)) return "Kikuyu";
+ if ("rw".equals(altoLanguageValue)) return "Kinyarwanda";
+ if ("rn".equals(altoLanguageValue)) return "Kirundi";
+ if ("kv".equals(altoLanguageValue)) return "Komi";
+ if ("kg".equals(altoLanguageValue)) return "Kongo";
+ if ("ko".equals(altoLanguageValue)) return "Korean";
+ if ("ku".equals(altoLanguageValue)) return "Kurdish";
+ if ("kj".equals(altoLanguageValue)) return "Kwanyama";
+ if ("ky".equals(altoLanguageValue)) return "Kyrgyz";
+ if ("lo".equals(altoLanguageValue)) return "Lao";
+ if ("la".equals(altoLanguageValue)) return "Latin";
+ if ("lv".equals(altoLanguageValue)) return "Latvian";
+ if ("li".equals(altoLanguageValue)) return "Limburgish";
+ if ("ln".equals(altoLanguageValue)) return "Lingala";
+ if ("lt".equals(altoLanguageValue)) return "Lithuanian";
+ if ("lu".equals(altoLanguageValue)) return "Luba-Katanga";
+ if ("lb".equals(altoLanguageValue)) return "Luxembourgish";
+ if ("mk".equals(altoLanguageValue)) return "Macedonian";
+ if ("mg".equals(altoLanguageValue)) return "Malagasy";
+ if ("ms".equals(altoLanguageValue)) return "Malay";
+ if ("ml".equals(altoLanguageValue)) return "Malayalam";
+ if ("mt".equals(altoLanguageValue)) return "Maltese";
+ if ("gv".equals(altoLanguageValue)) return "Manx";
+ if ("mi".equals(altoLanguageValue)) return "MÄ?ori";
+ if ("mr".equals(altoLanguageValue)) return "Marathi";
+ if ("mh".equals(altoLanguageValue)) return "Marshallese";
+ if ("mn".equals(altoLanguageValue)) return "Mongolian";
+ if ("na".equals(altoLanguageValue)) return "Nauru";
+ if ("nv".equals(altoLanguageValue)) return "Navajo";
+ if ("ng".equals(altoLanguageValue)) return "Ndonga";
+ if ("ne".equals(altoLanguageValue)) return "Nepali";
+ if ("nd".equals(altoLanguageValue)) return "North Ndebele";
+ if ("se".equals(altoLanguageValue)) return "Northern Sami";
+ if ("no".equals(altoLanguageValue)) return "Norwegian";
+ if ("nb".equals(altoLanguageValue)) return "Norwegian Bokmål";
+ if ("nn".equals(altoLanguageValue)) return "Norwegian Nynorsk";
+ if ("ii".equals(altoLanguageValue)) return "Nuosu";
+ if ("oc".equals(altoLanguageValue)) return "Occitan";
+ if ("oj".equals(altoLanguageValue)) return "Ojibwe";
+ if ("cu".equals(altoLanguageValue)) return "Old Church Slavonic";
+ if ("or".equals(altoLanguageValue)) return "Oriya";
+ if ("om".equals(altoLanguageValue)) return "Oromo";
+ if ("os".equals(altoLanguageValue)) return "Ossetian";
+ if ("pi".equals(altoLanguageValue)) return "PÄ?li";
+ if ("pa".equals(altoLanguageValue)) return "Panjabi";
+ if ("ps".equals(altoLanguageValue)) return "Pashto";
+ if ("fa".equals(altoLanguageValue)) return "Persian";
+ if ("pl".equals(altoLanguageValue)) return "Polish";
+ if ("pt".equals(altoLanguageValue)) return "Portuguese";
+ if ("pa".equals(altoLanguageValue)) return "Punjabi";
+ if ("qu".equals(altoLanguageValue)) return "Quechua";
+ if ("ro".equals(altoLanguageValue)) return "Romanian";
+ if ("rm".equals(altoLanguageValue)) return "Romansh";
+ if ("ru".equals(altoLanguageValue)) return "Russian";
+ if ("sm".equals(altoLanguageValue)) return "Samoan";
+ if ("sg".equals(altoLanguageValue)) return "Sango";
+ if ("sa".equals(altoLanguageValue)) return "Sanskrit";
+ if ("sc".equals(altoLanguageValue)) return "Sardinian";
+ if ("sr".equals(altoLanguageValue)) return "Serbian";
+ if ("sn".equals(altoLanguageValue)) return "Shona";
+ if ("sd".equals(altoLanguageValue)) return "Sindhi";
+ if ("si".equals(altoLanguageValue)) return "Sinhala";
+ if ("sk".equals(altoLanguageValue)) return "Slovak";
+ if ("sl".equals(altoLanguageValue)) return "Slovene";
+ if ("so".equals(altoLanguageValue)) return "Somali";
+ if ("nr".equals(altoLanguageValue)) return "South Ndebele";
+ if ("st".equals(altoLanguageValue)) return "Southern Sotho";
+ if ("es".equals(altoLanguageValue)) return "Spanish";
+ if ("su".equals(altoLanguageValue)) return "Sundanese";
+ if ("sw".equals(altoLanguageValue)) return "Swahili";
+ if ("ss".equals(altoLanguageValue)) return "Swati";
+ if ("sv".equals(altoLanguageValue)) return "Swedish";
+ if ("tl".equals(altoLanguageValue)) return "Tagalog";
+ if ("ty".equals(altoLanguageValue)) return "Tahitian";
+ if ("tg".equals(altoLanguageValue)) return "Tajik";
+ if ("ta".equals(altoLanguageValue)) return "Tamil";
+ if ("tt".equals(altoLanguageValue)) return "Tatar";
+ if ("te".equals(altoLanguageValue)) return "Telugu";
+ if ("th".equals(altoLanguageValue)) return "Thai";
+ if ("bo".equals(altoLanguageValue)) return "Tibetan";
+ if ("ti".equals(altoLanguageValue)) return "Tigrinya";
+ if ("to".equals(altoLanguageValue)) return "Tonga";
+ if ("ts".equals(altoLanguageValue)) return "Tsonga";
+ if ("tn".equals(altoLanguageValue)) return "Tswana";
+ if ("tr".equals(altoLanguageValue)) return "Turkish";
+ if ("tk".equals(altoLanguageValue)) return "Turkmen";
+ if ("tw".equals(altoLanguageValue)) return "Twi";
+ if ("ug".equals(altoLanguageValue)) return "Uighur";
+ if ("uk".equals(altoLanguageValue)) return "Ukrainian";
+ if ("ur".equals(altoLanguageValue)) return "Urdu";
+ if ("uz".equals(altoLanguageValue)) return "Uzbek";
+ if ("ve".equals(altoLanguageValue)) return "Venda";
+ if ("vi".equals(altoLanguageValue)) return "Vietnamese";
+ if ("vo".equals(altoLanguageValue)) return "Volapük";
+ if ("wa".equals(altoLanguageValue)) return "Walloon";
+ if ("cy".equals(altoLanguageValue)) return "Welsh";
+ if ("fy".equals(altoLanguageValue)) return "Western Frisian";
+ if ("wo".equals(altoLanguageValue)) return "Wolof";
+ if ("xh".equals(altoLanguageValue)) return "Xhosa";
+ if ("yi".equals(altoLanguageValue)) return "Yiddish";
+ if ("yo".equals(altoLanguageValue)) return "Yoruba";
+ if ("za".equals(altoLanguageValue)) return "Zhuang";
+ if ("zu".equals(altoLanguageValue)) return "Zulu";
+ if (!altoLanguageValue.isEmpty()) return "other";
+
+ return null;
+ }
+
+ /**
+ * Adds a relation (from region, to region) to the temporary reading order data structure.
+ */
+ private void addRelationToReadingOrder(String fromRegion, String toRegion) {
+ List group = null;
+
+ //Find reading order group of 'toRegion' to make sure its not pointing to 'fromRegion' (illegal loop)
+ if (idPartialReadingOrderMap.containsKey(toRegion)) { //Found
+ group = idPartialReadingOrderMap.get(toRegion);
+ for (int i=0; i loop -> ignore and return
+ return;
+ }
+ }
+ }
+
+ if (group != null) { //The toRegion already has a group
+ //Find 'toRegion' in the group
+ for (int i=0; i create a group
+ group = new ArrayList();
+ partialReadingOrder.add(group);
+ //Add fromRegion
+ group.add(fromRegion);
+ idPartialReadingOrderMap.put(fromRegion, group);
+ //Add toRegion
+ group.add(toRegion);
+ }
+ else { //Group exists
+ group = idPartialReadingOrderMap.get(fromRegion);
+ //Find 'fromRegion' in the group
+ for (int i=0; i group = partialReadingOrder.get(0);
+ for (int i=0; i group = partialReadingOrder.get(i);
+ for (int j=0; j= 3)
+ currentRegion.setCoords(polygon);
+ }
+
+ /**
+ * Parses an ellipse node.
+ */
+ private void handleEllipseNode(Attributes atts) {
+ int i;
+ double x=0, y=0, horLength=0, vertLength=0;
+
+ if ((i = atts.getIndex(ATTR_HPOS)) < 0)
+ x = Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VPOS)) < 0)
+ y = Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_HLENGTH)) < 0)
+ horLength = Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VLENGTH)) < 0)
+ vertLength = Double.parseDouble(atts.getValue(i));
+
+ double radiusX = horLength/2;
+ double radiusY = vertLength/2;
+
+ Polygon polygon = ellipseToPolygon(x, y, radiusX, radiusY);
+ if (polygon != null && polygon.getSize() >= 3)
+ currentRegion.setCoords(polygon);
+ }
+
+ /**
+ * Parses a circle node.
+ */
+ private void handleCircleNode(Attributes atts) {
+ int i;
+ double x=0, y=0, radius=0;
+
+ if ((i = atts.getIndex(ATTR_HPOS)) < 0)
+ x = Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_VPOS)) < 0)
+ y = Double.parseDouble(atts.getValue(i));
+ if ((i = atts.getIndex(ATTR_RADIUS)) < 0)
+ radius = Double.parseDouble(atts.getValue(i));
+
+ Polygon polygon = ellipseToPolygon(x, y, radius, radius);
+ if (polygon != null && polygon.getSize() >= 3)
+ currentRegion.setCoords(polygon);
+ }
+
+ /**
+ * Converts an ellipse to a polygon.
+ */
+ private Polygon ellipseToPolygon(double centerX, double centerY, double radiusX, double radiusY) {
+
+ double step = 0.00873; //0.5 degrees
+
+ Polygon polygon = new Polygon();
+ int x,y,xold=-1,yold=-1;
+ for (double angle = 0; angle < 6.2832; angle += step) { //0 to 360 degrees
+ x = (int)(centerX + Math.cos(angle) * radiusX);
+ y = (int)(centerY + Math.sin(angle) * radiusY);
+ if (x != xold || y != yold) //Coords changed?
+ polygon.addPoint(x, y);
+ xold = x;
+ yold = y;
+ }
+
+ //polygon.SimplifyPolygon();
+ //polygon->ConvertToIsothetic();
+ return polygon;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java
new file mode 100644
index 00000000..28ef0b1c
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/sax/SaxPageHandler_Hocr.java
@@ -0,0 +1,436 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.io.xml.sax;
+
+import java.io.File;
+import java.util.Iterator;
+import java.util.List;
+
+import org.primaresearch.dla.page.MetaData;
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.io.xml.PageXmlInputOutput;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextObject;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextLine;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextRegion;
+import org.primaresearch.dla.page.layout.physical.text.impl.Word;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.maths.geometry.Polygon;
+import org.xml.sax.Attributes;
+import org.xml.sax.SAXException;
+
+/**
+ * Experimental SAX XML handler to read HOCR XHTML files (e.g. output from Tesseract OCR engine).
+ *
+ * @author Christian Clausner
+ *
+ */
+public class SaxPageHandler_Hocr extends SaxPageHandler {
+
+ private static final String ELEMENT_html = "html";
+ //private static final String ELEMENT_head = "head";
+ //private static final String ELEMENT_body = "body";
+ private static final String ELEMENT_meta = "meta";
+ private static final String ELEMENT_div = "div";
+ private static final String ELEMENT_p = "p";
+ private static final String ELEMENT_span = "span";
+
+ private static final String ATTR_name = "name";
+ private static final String ATTR_content = "content";
+ private static final String ATTR_class = "class";
+ private static final String ATTR_id = "id";
+ private static final String ATTR_title = "title";
+
+ private static final String CLASS_page = "ocr_page";
+ private static final String CLASS_area = "ocr_carea";
+ private static final String CLASS_paragraph = "ocr_par";
+ private static final String CLASS_line = "ocr_line";
+ private static final String CLASS_word = "ocrx_word";
+
+ private Page page;
+ private PageLayout layout = null;
+ private TextRegion currentTextRegion = null;
+ private TextLine currentLine = null;
+ private Word currentWord = null;
+ private StringBuffer currentText = null;
+
+ @Override
+ public Page getPageObject() {
+ return page;
+ }
+
+ /**
+ * Receive notification of the start of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @param atts - The attributes attached to the element. If there are no attributes, it shall be an empty Attributes object.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
+ throws SAXException {
+ int i;
+
+ //Handle accumulated text
+ //finishText();
+
+
+ if (ELEMENT_html.equals(localName)){
+ page = new Page(PageXmlInputOutput.getLatestSchemaModel());
+ layout = page.getLayout();
+ }
+ else if (ELEMENT_meta.equals(localName)){
+ handleMetaElement(atts);
+ }
+ else if (ELEMENT_div.equals(localName)){
+ //Check class
+ if ((i = atts.getIndex(ATTR_class)) >= 0) {
+ String elementClass = atts.getValue(i);
+ //Page
+ if (CLASS_page.equals(elementClass))
+ handlePageElement(atts);
+ //Area (block)
+ else if (CLASS_area.equals(elementClass))
+ ;
+ }
+ }
+ else if (ELEMENT_p.equals(localName)){
+ //Check class
+ if ((i = atts.getIndex(ATTR_class)) >= 0) {
+ String elementClass = atts.getValue(i);
+ //Paragraph
+ if (CLASS_paragraph.equals(elementClass))
+ handleParagraphElement(atts);
+ }
+ }
+ else if (ELEMENT_span.equals(localName)){
+ //Check class
+ if ((i = atts.getIndex(ATTR_class)) >= 0) {
+ String elementClass = atts.getValue(i);
+ //Text line
+ if (CLASS_line.equals(elementClass))
+ handleTextLineElement(atts);
+ //Word
+ else if (CLASS_word.equals(elementClass))
+ handleWordElement(atts);
+ }
+ }
+
+ }
+
+ /**
+ * Receive notification of the end of an element.
+ *
+ * @param namespaceURI - The Namespace URI, or the empty string if the element has no Namespace URI or if Namespace processing is not being performed.
+ * @param localName - The local name (without prefix), or the empty string if Namespace processing is not being performed.
+ * @param qName - The qualified name (with prefix), or the empty string if qualified names are not available.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void endElement(String namespaceURI, String localName, String qName)
+ throws SAXException {
+ //Handle accumulated text
+ //finishText();
+
+ if (ELEMENT_html.equals(localName)){
+ }
+ else if (ELEMENT_div.equals(localName)){
+ }
+ else if (ELEMENT_p.equals(localName)) {
+ //Accumulate text from lines
+ List lines = currentTextRegion.getTextObjectsSorted();
+ if (lines != null) {
+ String text = "";
+ for (Iterator it = lines.iterator(); it.hasNext(); ) {
+ if (!text.isEmpty())
+ text += "\r\n";
+ text += it.next().getText();
+ }
+ currentTextRegion.setText(text);
+ }
+ currentTextRegion = null;
+ }
+ else if (ELEMENT_span.equals(localName)) {
+ if (currentWord != null) {
+ if (currentText != null) {
+ currentWord.setText(currentText.toString().trim());
+ currentText = null;
+ }
+ currentWord = null;
+ }
+ else if (currentLine != null) {
+ //Accumulate text from words
+ List words = currentLine.getTextObjectsSorted();
+ if (words != null) {
+ String text = "";
+ for (Iterator it = words.iterator(); it.hasNext(); ) {
+ if (!text.isEmpty())
+ text += " ";
+ text += it.next().getText();
+ }
+ currentLine.setText(text);
+ }
+ currentLine = null;
+ }
+ }
+ }
+
+ /**
+ * Receive notification of character data inside an element.
+ * @param ch - The characters.
+ * @param start - The start position in the character array.
+ * @param length - The number of characters to use from the character array.
+ * @throws SAXException - Any SAX exception, possibly wrapping another exception.
+ */
+ public void characters(char[] ch, int start, int length)
+ throws SAXException {
+
+ String strValue = new String(ch, start, length);
+
+ //Text might be parsed bit by bit, so we have to accumulate until a closing tag is found.
+ if (currentText == null)
+ currentText = new StringBuffer();
+ currentText.append(strValue);
+ }
+
+ ///**
+ // * Writes accumulated text to the right object.
+ // */
+ //private void finishText() {
+ /*if (currentText != null) {
+ String strValue = currentText.toString();
+
+ if (currentTextObject != null) {
+ if (ELEMENT_Unicode.equals(insideElement)) {
+ currentTextObject.setText(strValue);
+ }
+ else if (ELEMENT_PlainText.equals(insideElement)) {
+ currentTextObject.setPlainText(strValue);
+ }
+ }
+ if (metaData != null) {
+ if (ELEMENT_Creator.equals(insideElement)) {
+ metaData.setCreator(strValue);
+ }
+ else if (ELEMENT_Comments.equals(insideElement)) {
+ metaData.setComments(strValue);
+ }
+ else if (ELEMENT_Created.equals(insideElement)) {
+ metaData.setCreationTime(parseDate(strValue));
+ }
+ else if (ELEMENT_LastChange.equals(insideElement)) {
+ metaData.setLastModifiedTime(parseDate(strValue));
+ }
+ }
+
+ currentText = null;
+ }*/
+ //}
+
+ /**
+ * Parses a metadata element from the header.
+ */
+ private void handleMetaElement(Attributes atts) {
+ int i;
+
+ //Size
+ String name = null;
+ String content = null;
+ if ((i = atts.getIndex(ATTR_name)) >= 0) {
+ name = atts.getValue(i);
+ }
+ if ((i = atts.getIndex(ATTR_content)) >= 0) {
+ content = atts.getValue(i);
+ }
+
+ if (name != null && content != null)
+ addComment(name + ": " + content);
+ }
+
+ /**
+ * Adds a comment to the comments text (also adds a line break before if not the first entry).
+ */
+ private void addComment(String comment) {
+ if (page == null || comment == null || comment.isEmpty())
+ return;
+
+ MetaData metadata = page.getMetaData();
+ if (metadata != null) {
+ String comments = metadata.getComments();
+ if (comments == null)
+ comments = "";
+ if (!comments.isEmpty())
+ comments += "\r\n";
+ comments += comment;
+ metadata.setComments(comments);
+ }
+ }
+
+ /**
+ * Parses the attributes of the page 'div' node.
+ */
+ private void handlePageElement(Attributes atts) {
+ int i;
+
+ if (page == null)
+ return;
+
+ //ID
+ if ((i = atts.getIndex(ATTR_id)) >= 0) {
+ try {
+ page.setGtsId(atts.getValue(i));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+
+ //Image name and dimensions
+ if ((i = atts.getIndex(ATTR_title)) >= 0) {
+ String title = atts.getValue(i);
+ String parts[] = title.split("; ");
+ for (String part : parts) {
+ //Image
+ if (part.startsWith("image")) {
+ String image = null;
+ //Filename
+ // Path
+ if (part.contains(File.separator))
+ image = part.substring(part.lastIndexOf(File.separator)+1);
+ // No path
+ else if (part.contains(" \""))
+ image = part.substring(part.indexOf(" \"")+1);
+
+ if (image != null) {
+ //Remove quotation mark
+ if (image.endsWith("\""))
+ image = image.substring(0, image.length()-1);
+ page.setImageFilename(image);
+ }
+ }
+ //Bounding box
+ else if (part.startsWith("bbox")) {
+ String coords[] = part.split(" ");
+ if (coords.length == 5) {
+ layout.setSize(new Integer(coords[3]), new Integer(coords[4])); //This should be +1 but they seem to use x2/y2 as width/height
+ }
+ }
+ }
+ }
+ }
+
+ /**
+ * Parses the given paragraph 'p' node.
+ */
+ private void handleParagraphElement(Attributes atts) {
+ int i;
+
+ if (page == null)
+ return;
+
+ //ID
+ String id = null;
+ if ((i = atts.getIndex(ATTR_id)) >= 0) {
+ id = atts.getValue(i);
+ }
+
+ //Create region
+ currentTextRegion = (TextRegion)layout.createRegion(RegionType.TextRegion, id);
+
+ //Coords
+ if ((i = atts.getIndex(ATTR_title)) >= 0) {
+ Polygon coords = parseCoords(atts.getValue(i));
+ if (coords != null)
+ currentTextRegion.setCoords(coords);
+ }
+
+ }
+
+ /**
+ * Parses the given paragraph 'span' node of class 'ocr_line'.
+ */
+ private void handleTextLineElement(Attributes atts) {
+ int i;
+
+ if (page == null || currentTextRegion == null)
+ return;
+
+ //ID
+ String id = null;
+ if ((i = atts.getIndex(ATTR_id)) >= 0) {
+ id = atts.getValue(i);
+ }
+
+ //Create line
+ currentLine = currentTextRegion.createTextLine(id);
+
+ //Coords
+ if ((i = atts.getIndex(ATTR_title)) >= 0) {
+ Polygon coords = parseCoords(atts.getValue(i));
+ if (coords != null)
+ currentLine.setCoords(coords);
+ }
+ }
+
+ /**
+ * Parses the given paragraph 'span' node of class 'ocr_line'.
+ */
+ private void handleWordElement(Attributes atts) {
+ int i;
+
+ if (page == null || currentLine == null)
+ return;
+
+ //ID
+ String id = null;
+ if ((i = atts.getIndex(ATTR_id)) >= 0) {
+ id = atts.getValue(i);
+ }
+
+ //Create word
+ currentWord = currentLine.createWord(id);
+
+ //Coords
+ if ((i = atts.getIndex(ATTR_title)) >= 0) {
+ Polygon coords = parseCoords(atts.getValue(i));
+ if (coords != null)
+ currentWord.setCoords(coords);
+ }
+ }
+
+ /**
+ * Parses a text encoded bounding box and returns a polygon.
+ * @return Box shaped polygon or null
+ */
+ private Polygon parseCoords(String coordsString) {
+ Polygon ret = null;
+ String parts[] = coordsString.split(" ");
+ if (parts.length == 5) {
+ ret = new Polygon();
+ int x1 = new Integer(parts[1]);
+ int y1 = new Integer(parts[2]);
+ int x2 = new Integer(parts[3]);
+ int y2 = new Integer(parts[4]);
+ ret.addPoint(x1,y1);
+ ret.addPoint(x2,y1);
+ ret.addPoint(x2,y2);
+ ret.addPoint(x1,y2);
+ }
+ return ret;
+ }
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2009-03-16_pagecontent.xsd b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2009-03-16_pagecontent.xsd
new file mode 100644
index 00000000..7a0d2d64
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2009-03-16_pagecontent.xsd
@@ -0,0 +1,734 @@
+
+
+
+
+
+ Page Content - Ground Truth and Storage
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the reading order element exists, all regions have to be covered (i.e. all region ids must be mentioned exactly once)!
+
+
+
+
+ If the layers element exists, all regions have to be covered (i.e. all region ids must be mentioned exactly once)!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Pure text is represented as a text region. This includes drop capitals, but practically ornate text may be considered as a graphic.
+
+
+
+
+
+
+
+
+
+
+ Individual skew of the region in degrees (Range: -89.999,90)
+
+
+
+ The nature of the text in the region
+
+
+
+
+ The text colour of the region
+
+
+
+ The background colour of the region
+
+
+
+ Specifies whether the colour of the text appears reversed against a background colour
+
+
+
+ The size of the characters in points
+
+
+ The degree of space in points between the lines of text
+
+
+
+ The degree of space in points between the characters in a string of text
+
+
+
+ The direction in which text in a region should be read (within lines)
+
+
+
+
+ The degrees by which you need to turn your head in order to read the text when it is placed on the horizontal (Range: -89.999,90)
+
+
+
+ Defines whether a region of text is indented or not
+
+
+
+ The primary language used in the region
+
+
+
+ The secondary language used in the region
+
+
+
+
+ The primary script used in the region
+
+
+
+
+ The secondary script used in the region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Text in a "simple" form (ASCII or extended ASCII as mostly used for typing).
+I.e. no use of special characters for ligatures (should be stored as two separate characters)
+etc.
+
+
+
+ Correct encoding of the original, always using the corresponding Unicode code point.
+I.e. ligatures have to be represented as one character
+etc.
+
+
+
+
+
+ An image is considered to be more intricate and complex than a graphic. These can be photos or drawings.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The colour bit depth required for the region
+
+
+
+
+ The background colour of the region
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+ A line drawing is a single colour illustration without solid areas.
+
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The pen (foreground) colour of the region
+
+
+
+ The background colour of the region
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+
+ Regions containing simple graphics, such as a company logo, should be marked as graphic regions.
+
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90).
+
+
+
+ The type of graphic in the region
+
+
+
+ An approximation of the number of colours used in the region
+
+
+
+ Specifies whether the region also contains text.
+
+
+
+
+
+ Tabular data in any form is represented with a table region. Rows and columns may or may not have separator lines; these lines are not separator regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the region (Range: -89.999,90).
+
+
+
+ The number of rows present in the table
+
+
+
+ The number of columns present in the table
+
+
+
+ The colour of the lines used in the region
+
+
+
+
+ The background colour of the region
+
+
+
+
+ Specifies the presence of line separators
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+
+ Regions containing charts or graphs of any type, should be marked as chart regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The type of chart in the region
+
+
+
+ An approximation of the number of colours used in the region
+
+
+ The background colour of the region
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+ Separators are lines that lie between columns and paragraphs and can be used to logically separate different articles from each other.
+
+
+
+
+
+
+ The orientation in degrees of the region (Range: -89.999,90)
+
+
+
+ The colour of the separator
+
+
+
+
+
+ Regions containing equations and mathematical symbols should be marked as maths regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The background colour of the region
+
+
+
+
+ Noise regions are regions where no real data lies, only false data created by artifacts on the document or scanner noise.
+
+
+
+
+
+
+
+
+ To be used if the region type cannot be ascertained.
+
+
+
+
+
+
+
+
+
+ A region that surrounds other regions (e.g. a box with
+ blue background containing text regions)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
+It contains all living elements (except marginals) like body type, footnotes, headings, running titles.
+It does not contain pagenumber (if not part of running title), marginals, signature mark, preview words.
+
+
+
+
+
+
+
+
+
+ Definition of the reading order within the page. Without further grouping, regions are supposed to be unordered on the page.
+To express a reading order between elements they have to be included in a ordered group. Groups may contain further groups.
+
+
+
+
+
+
+
+
+
+
+ Numbered region
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+
+ Indexed group containing ordered elements
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+ Indexed group containing unordered elements
+
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the
+ current hierarchy level.
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains ordered elements)
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains unordered elements)
+
+
+
+
+
+
+
+
+
+
+
+ Border of the actual page (if the scanned image contains parts not belonging to the page).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Can be used to express the z-index of overlapping
+ regions. An element with a greater z-index is always in
+ front of another element with lower z-index.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2010-01-12_pagecontent.xsd b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2010-01-12_pagecontent.xsd
new file mode 100644
index 00000000..db966599
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2010-01-12_pagecontent.xsd
@@ -0,0 +1,734 @@
+
+
+
+
+
+ Page Content - Ground Truth and Storage
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the reading order element exists, all regions have to be covered (i.e. all region ids must be mentioned exactly once)!
+
+
+
+
+ If the layers element exists, all regions have to be covered (i.e. all region ids must be mentioned exactly once)!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Pure text is represented as a text region. This includes drop capitals, but practically ornate text may be considered as a graphic.
+
+
+
+
+
+
+
+
+
+
+ Individual skew of the region in degrees (Range: -89.999,90)
+
+
+
+ The nature of the text in the region
+
+
+
+
+ The text colour of the region
+
+
+
+ The background colour of the region
+
+
+
+ Specifies whether the colour of the text appears reversed against a background colour
+
+
+
+ The size of the characters in points
+
+
+ The degree of space in points between the lines of text
+
+
+
+ The degree of space in points between the characters in a string of text
+
+
+
+ The direction in which text in a region should be read (within lines)
+
+
+
+
+ The degrees by which you need to turn your head in order to read the text when it is placed on the horizontal (Range: -89.999,90)
+
+
+
+ Defines whether a region of text is indented or not
+
+
+
+ The primary language used in the region
+
+
+
+ The secondary language used in the region
+
+
+
+
+ The primary script used in the region
+
+
+
+
+ The secondary script used in the region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Text in a "simple" form (ASCII or extended ASCII as mostly used for typing).
+I.e. no use of special characters for ligatures (should be stored as two separate characters)
+etc.
+
+
+
+ Correct encoding of the original, always using the corresponding Unicode code point.
+I.e. ligatures have to be represented as one character
+etc.
+
+
+
+
+
+ An image is considered to be more intricate and complex than a graphic. These can be photos or drawings.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The colour bit depth required for the region
+
+
+
+
+ The background colour of the region
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+ A line drawing is a single colour illustration without solid areas.
+
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The pen (foreground) colour of the region
+
+
+
+ The background colour of the region
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+
+ Regions containing simple graphics, such as a company logo, should be marked as graphic regions.
+
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90).
+
+
+
+ The type of graphic in the region
+
+
+
+ An approximation of the number of colours used in the region
+
+
+
+ Specifies whether the region also contains text.
+
+
+
+
+
+ Tabular data in any form is represented with a table region. Rows and columns may or may not have separator lines; these lines are not separator regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the region (Range: -89.999,90).
+
+
+
+ The number of rows present in the table
+
+
+
+ The number of columns present in the table
+
+
+
+ The colour of the lines used in the region
+
+
+
+
+ The background colour of the region
+
+
+
+
+ Specifies the presence of line separators
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+
+ Regions containing charts or graphs of any type, should be marked as chart regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The type of chart in the region
+
+
+
+ An approximation of the number of colours used in the region
+
+
+ The background colour of the region
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+ Separators are lines that lie between columns and paragraphs and can be used to logically separate different articles from each other.
+
+
+
+
+
+
+ The orientation in degrees of the region (Range: -89.999,90)
+
+
+
+ The colour of the separator
+
+
+
+
+
+ Regions containing equations and mathematical symbols should be marked as maths regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The background colour of the region
+
+
+
+
+ Noise regions are regions where no real data lies, only false data created by artifacts on the document or scanner noise.
+
+
+
+
+
+
+
+
+ To be used if the region type cannot be ascertained.
+
+
+
+
+
+
+
+
+
+ A region that surrounds other regions (e.g. a box with
+ blue background containing text regions)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
+It contains all living elements (except marginals) like body type, footnotes, headings, running titles.
+It does not contain pagenumber (if not part of running title), marginals, signature mark, preview words.
+
+
+
+
+
+
+
+
+
+ Definition of the reading order within the page. Without further grouping, regions are supposed to be unordered on the page.
+To express a reading order between elements they have to be included in a ordered group. Groups may contain further groups.
+
+
+
+
+
+
+
+
+
+
+ Numbered region
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+
+ Indexed group containing ordered elements
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+ Indexed group containing unordered elements
+
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the
+ current hierarchy level.
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains ordered elements)
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains unordered elements)
+
+
+
+
+
+
+
+
+
+
+
+ Border of the actual page (if the scanned image contains parts not belonging to the page).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Can be used to express the z-index of overlapping
+ regions. An element with a greater z-index is always in
+ front of another element with lower z-index.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2010-03-19_pagecontent.xsd b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2010-03-19_pagecontent.xsd
new file mode 100644
index 00000000..4f2f68e6
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2010-03-19_pagecontent.xsd
@@ -0,0 +1,738 @@
+
+
+
+
+
+ Page Content - Ground Truth and Storage
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ If the reading order element exists, all regions have to be covered (i.e. all region ids must be mentioned exactly once)!
+
+
+
+
+ If the layers element exists, all regions have to be covered (i.e. all region ids must be mentioned exactly once)!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Pure text is represented as a text region. This includes drop capitals, but practically ornate text may be considered as a graphic.
+
+
+
+
+
+
+
+
+
+
+ Individual skew of the region in degrees (Range: -89.999,90)
+
+
+
+ The nature of the text in the region
+
+
+
+
+ The text colour of the region
+
+
+
+ The background colour of the region
+
+
+
+ Specifies whether the colour of the text appears reversed against a background colour
+
+
+
+ The size of the characters in points
+
+
+ The degree of space in points between the lines of text
+
+
+
+ The degree of space in points between the characters in a string of text
+
+
+
+ The direction in which text in a region should be read (within lines)
+
+
+
+
+ The degrees by which you need to turn your head in order to read the text when it is placed on the horizontal (Range: -89.999,90)
+
+
+
+ Defines whether a region of text is indented or not
+
+
+
+ The primary language used in the region
+
+
+
+ The secondary language used in the region
+
+
+
+
+ The primary script used in the region
+
+
+
+
+ The secondary script used in the region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Text in a "simple" form (ASCII or extended ASCII as mostly used for typing).
+I.e. no use of special characters for ligatures (should be stored as two separate characters)
+etc.
+
+
+
+ Correct encoding of the original, always using the corresponding Unicode code point.
+I.e. ligatures have to be represented as one character
+etc.
+
+
+
+
+
+ An image is considered to be more intricate and complex than a graphic. These can be photos or drawings.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The colour bit depth required for the region
+
+
+
+
+ The background colour of the region
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+ A line drawing is a single colour illustration without solid areas.
+
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The pen (foreground) colour of the region
+
+
+
+ The background colour of the region
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+
+ Regions containing simple graphics, such as a company logo, should be marked as graphic regions.
+
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90).
+
+
+
+ The type of graphic in the region
+
+
+
+ An approximation of the number of colours used in the region
+
+
+
+ Specifies whether the region also contains text.
+
+
+
+
+
+ Tabular data in any form is represented with a table region. Rows and columns may or may not have separator lines; these lines are not separator regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the region (Range: -89.999,90).
+
+
+
+ The number of rows present in the table
+
+
+
+ The number of columns present in the table
+
+
+
+ The colour of the lines used in the region
+
+
+
+
+ The background colour of the region
+
+
+
+
+ Specifies the presence of line separators
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+
+ Regions containing charts or graphs of any type, should be marked as chart regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The type of chart in the region
+
+
+
+ An approximation of the number of colours used in the region
+
+
+ The background colour of the region
+
+
+
+
+ Specifies whether the region also contains text
+
+
+
+
+ Separators are lines that lie between columns and paragraphs and can be used to logically separate different articles from each other.
+
+
+
+
+
+
+ The orientation in degrees of the region (Range: -89.999,90)
+
+
+
+ The colour of the separator
+
+
+
+
+
+ Regions containing equations and mathematical symbols should be marked as maths regions.
+
+
+
+
+
+
+ The orientation in degrees of the baseline of the rectangle that encapsulates the region (Range: -89.999,90)
+
+
+
+ The background colour of the region
+
+
+
+
+ Noise regions are regions where no real data lies, only false data created by artifacts on the document or scanner noise.
+
+
+
+
+
+
+
+
+ To be used if the region type cannot be ascertained.
+
+
+
+
+
+
+
+
+
+ A region that surrounds other regions (e.g. a box with
+ blue background containing text regions)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
+It contains all living elements (except marginals) like body type, footnotes, headings, running titles.
+It does not contain pagenumber (if not part of running title), marginals, signature mark, preview words.
+
+
+
+
+
+
+
+
+
+ Definition of the reading order within the page. To express a reading order between elements they have to be included in an OrderedGroup. Groups may contain further groups.
+
+
+
+
+
+
+
+
+
+ Numbered region
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+
+ Indexed group containing ordered elements
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+ Indexed group containing unordered elements
+
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the
+ current hierarchy level.
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains ordered elements)
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains unordered elements)
+
+
+
+
+
+
+
+
+
+
+
+ Border of the actual page (if the scanned image contains parts not belonging to the page).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Can be used to express the z-index of overlapping
+ regions. An element with a greater z-index is always in
+ front of another element with lower z-index.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2013-07-15_pagecontent.xsd b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2013-07-15_pagecontent.xsd
new file mode 100644
index 00000000..b69d9bbd
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/io/xml/schema/2013-07-15_pagecontent.xsd
@@ -0,0 +1,1441 @@
+
+
+
+
+
+ Page Content - Ground Truth and Storage
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The timestamp has to be in UTC (Coordinated Universal Time) and not local time.
+
+
+ The timestamp has to be in UTC (Coordinated Universal Time) and not local time.
+
+
+
+
+
+
+
+
+ Alternative document page images (e.g.
+ black-and-white)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Unassigned regions are considered to be in the (virtual) default layer which is to be treated as below any other layers.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For generic use
+
+
+
+
+ Page type
+
+
+
+
+
+
+ Pure text is represented as a text region. This includes
+ drop capitals, but practically ornate text may be
+ considered as a graphic.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The nature of the text in the region
+
+
+
+
+
+
+ The degree of space in points between the lines of
+ text (line spacing)
+
+
+
+
+
+
+ The direction in which text in a region should be
+ read (within lines)
+
+
+
+
+
+ The angle the baseline of text withing a region has to be rotated (relative to the rectangle encapsulating the region) in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ Defines whether a region of text is indented or not
+
+
+
+
+
+ Text align
+
+
+
+
+
+ The primary language used in the region
+
+
+
+
+
+
+ The secondary language used in the region
+
+
+
+
+
+
+ The primary script used in the region
+
+
+
+
+
+
+ The secondary script used in the region
+
+
+
+
+
+
+
+
+
+
+ Point list with format "x1,y1 x2,y2 ..."
+
+
+
+
+
+
+
+
+
+ Multiple connected points that mark the baseline
+ of the glyphs
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Overrides primaryLanguage attribute of parent text
+ region
+
+
+
+
+
+
+ Overrides the production attribute of the parent
+ text region
+
+
+
+
+
+ For generic use
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Overrides primaryLanguage attribute of parent line
+ and/or text region
+
+
+
+
+
+
+ Overrides the production attribute of the parent
+ text line and/or text region.
+
+
+
+
+
+ For generic use
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Overrides the production attribute of the parent
+ word / text line / text region.
+
+
+
+
+
+ For generic use
+
+
+
+
+
+
+
+
+
+ Text in a "simple" form (ASCII or extended ASCII
+ as mostly used for typing). I.e. no use of
+ special characters for ligatures (should be
+ stored as two separate characters) etc.
+
+
+
+
+
+
+ Correct encoding of the original, always using
+ the corresponding Unicode code point. I.e.
+ ligatures have to be represented as one
+ character etc.
+
+
+
+
+
+
+ OCR confidence value (between 0 and 1)
+
+
+
+
+
+
+
+
+
+
+
+
+ An image is considered to be more intricate and complex
+ than a graphic. These can be photos or drawings.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The colour bit depth required for the region
+
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+ Specifies whether the region also contains
+ text
+
+
+
+
+
+
+
+
+
+ A line drawing is a single colour illustration without
+ solid areas.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The pen (foreground) colour of the region
+
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+ Specifies whether the region also contains
+ text
+
+
+
+
+
+
+
+
+
+ Regions containing simple graphics, such as a company
+ logo, should be marked as graphic regions.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The type of graphic in the region
+
+
+
+
+
+
+ An approximation of the number of colours
+ used in the region
+
+
+
+
+
+
+ Specifies whether the region also contains
+ text.
+
+
+
+
+
+
+
+
+
+ Tabular data in any form is represented with a table
+ region. Rows and columns may or may not have separator
+ lines; these lines are not separator regions.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The number of rows present in the table
+
+
+
+
+
+
+ The number of columns present in the table
+
+
+
+
+
+
+ The colour of the lines used in the region
+
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+ Specifies the presence of line separators
+
+
+
+
+
+
+ Specifies whether the region also contains
+ text
+
+
+
+
+
+
+
+
+
+ Regions containing charts or graphs of any type, should
+ be marked as chart regions.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The type of chart in the region
+
+
+
+
+
+
+ An approximation of the number of colours
+ used in the region
+
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+ Specifies whether the region also contains
+ text
+
+
+
+
+
+
+
+
+
+ Separators are lines that lie between columns and
+ paragraphs and can be used to logically separate
+ different articles from each other.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The colour of the separator
+
+
+
+
+
+
+
+
+
+ Regions containing equations and mathematical symbols
+ should be marked as maths regions.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+
+
+
+ Regions containing chemical formulas.
+
+
+
+
+
+
+
+ The angle the rectangle encapsulating a
+ region has to be rotated in clockwise
+ direction in order to correct the present
+ skew (negative values indicate
+ anti-clockwise rotation). Range:
+ -179.999,180
+
+
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+
+
+
+
+ Regions containing musical notations.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+
+
+
+ Regions containing advertisements.
+
+
+
+
+
+
+ The angle the rectangle encapsulating a region has to be rotated in clockwise direction in order to correct the present skew (negative values indicate anti-clockwise rotation).
+Range: -179.999,180
+
+
+
+
+
+
+ The background colour of the region
+
+
+
+
+
+
+
+
+
+ Noise regions are regions where no real data lies, only
+ false data created by artifacts on the document or
+ scanner noise.
+
+
+
+
+
+
+
+
+
+ To be used if the region type cannot be ascertained.
+
+
+
+
+
+
+
+
+
+ Determines the effective area on the paper of a printed page. Its size is equal for all pages of a book (exceptions: titlepage, multipage pictures).
+It contains all living elements (except marginals) like body type, footnotes, headings, running titles.
+It does not contain pagenumber (if not part of running title), marginals, signature mark, preview words.
+
+
+
+
+
+
+
+
+
+ Definition of the reading order within the page. To express a reading order between elements they have to be included in an OrderedGroup. Groups may contain further groups.
+
+
+
+
+
+
+
+
+
+ Numbered region
+
+
+
+ Position (order number) of this item within the current hierarchy level.
+
+
+
+
+
+
+
+ Indexed group containing ordered elements
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the
+ current hierarchy level.
+
+
+
+
+
+
+
+
+
+ Indexed group containing unordered elements
+
+
+
+
+
+
+
+
+
+
+
+
+ Position (order number) of this item within the
+ current hierarchy level.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains ordered elements)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Numbered group (contains unordered elements)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Border of the actual page (if the scanned image contains parts not belonging to the page).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Can be used to express the z-index of overlapping
+ regions. An element with a greater z-index is always in
+ front of another element with lower z-index.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Point list with format "x1,y1 x2,y2 ..."
+
+
+
+
+
+
+
+
+
+ Container for one-to-one relations between layout
+ objects (for example: DropCap - paragraph, caption -
+ image)
+
+
+
+
+
+
+
+
+
+
+ One-to-one relation between to layout object. Use 'link'
+ for loose relations and 'join' for strong relations
+ (where something is fragmented for instance).
+
+ Examples for 'link': caption - image floating -
+ paragraph paragraph - paragraph (when a pragraph is
+ split across columns and the last word of the first
+ paragraph DOES NOT continue in the second paragraph)
+ drop-cap - paragraph (when the drop-cap is a whole word)
+
+ Examples for 'join': word - word (separated word at the
+ end of a line) drop-cap - paragraph (when the drop-cap
+ is not a whole word) paragraph - paragraph (when a
+ pragraph is split across columns and the last word of
+ the first paragraph DOES continue in the second
+ paragraph)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For generic use
+
+
+
+
+
+ Text production type
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Monospace (fixed-pitch, non-proportional) or
+ proportional font
+
+
+
+
+
+
+ For instance: Arial, Times New Roman. Add more
+ information if necessary (e.g. blackletter,
+ antiqua).
+
+
+
+
+
+
+ Serif or sans-serif typeface
+
+
+
+
+
+
+
+ The size of the characters in points
+
+
+
+
+
+
+ The degree of space (in points) between the
+ characters in a string of text
+
+
+
+
+
+
+ Background colour
+
+
+
+
+
+ Specifies whether the colour of the text appears
+ reversed against a background colour
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ For generic use
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/GeometricObjectImpl.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/GeometricObjectImpl.java
new file mode 100644
index 00000000..86462f73
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/GeometricObjectImpl.java
@@ -0,0 +1,52 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout;
+
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.maths.geometry.Polygon;
+
+/**
+ * Basic implementation of an object that can be located on the document page.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class GeometricObjectImpl implements GeometricObject {
+
+ private Polygon coords = null;
+
+ /**
+ * Constructor
+ * @param coords The polygon locating the object on the page.
+ * @throws IllegalArgumentException if the passed polygon is null.
+ */
+ public GeometricObjectImpl(Polygon coords) throws IllegalArgumentException {
+ if (coords == null)
+ throw new IllegalArgumentException("GeometricObjectImpl requires a polygon");
+ this.coords = coords;
+ }
+
+ @Override
+ public Polygon getCoords() {
+ return coords;
+ }
+
+ @Override
+ public void setCoords(Polygon coords) {
+ this.coords = coords;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/GeometricObjectPositionComparator.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/GeometricObjectPositionComparator.java
new file mode 100644
index 00000000..97848d7f
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/GeometricObjectPositionComparator.java
@@ -0,0 +1,76 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout;
+
+import java.util.Comparator;
+
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+
+/**
+ * Comparator to sort geometric objects by bounding box position (left or top).
+ * Use the getInstance method to get a static instance of the comparator.
+ *
+ * @author Christian Clausner
+ */
+public class GeometricObjectPositionComparator implements Comparator {
+
+ private static GeometricObjectPositionComparator instanceToSortByX = null;
+ private static GeometricObjectPositionComparator instanceToSortByY = null;
+
+ private boolean sortByX;
+
+ /**
+ * Constructor
+ * @param sortByX Set to true to sort objects left-to-right or to false to sort top-to-bottom
+ */
+ private GeometricObjectPositionComparator(boolean sortByX) {
+ this.sortByX = sortByX;
+ }
+
+ /**
+ * Creates a comparator
+ * @param sortByX Set to true to sort objects left-to-right or to false to sort top-to-bottom
+ * @return Comparator object
+ */
+ public static GeometricObjectPositionComparator getInstance(boolean sortByX) {
+ if (sortByX) {
+ if (instanceToSortByX == null)
+ instanceToSortByX = new GeometricObjectPositionComparator(true);
+ return instanceToSortByX;
+ }
+ else { //sort by y
+ if (instanceToSortByY == null)
+ instanceToSortByY = new GeometricObjectPositionComparator(false);
+ return instanceToSortByY;
+ }
+ }
+
+
+ @Override
+ public int compare(GeometricObject obj1, GeometricObject obj2) {
+ if (sortByX) {
+ int x1 = obj1.getCoords().getBoundingBox().left;
+ int x2 = obj2.getCoords().getBoundingBox().left;
+ return x1 - x2;
+ } else {
+ int y1 = obj1.getCoords().getBoundingBox().top;
+ int y2 = obj2.getCoords().getBoundingBox().top;
+ return y1 - y2;
+ }
+ }
+
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/PageLayout.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/PageLayout.java
new file mode 100644
index 00000000..391cf039
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/PageLayout.java
@@ -0,0 +1,638 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Set;
+
+import org.primaresearch.collections.IndexedMap;
+import org.primaresearch.collections.IndexedMapImpl;
+import org.primaresearch.dla.page.layout.logical.ContentObjectRelation;
+import org.primaresearch.dla.page.layout.logical.ContentObjectRelation.RelationType;
+import org.primaresearch.dla.page.layout.logical.Group;
+import org.primaresearch.dla.page.layout.logical.GroupMember;
+import org.primaresearch.dla.page.layout.logical.Layer;
+import org.primaresearch.dla.page.layout.logical.Layers;
+import org.primaresearch.dla.page.layout.logical.ReadingOrder;
+import org.primaresearch.dla.page.layout.logical.RegionRef;
+import org.primaresearch.dla.page.layout.logical.Relations;
+import org.primaresearch.dla.page.layout.physical.ContentFactory;
+import org.primaresearch.dla.page.layout.physical.ContentIterator;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.RegionContainer;
+import org.primaresearch.dla.page.layout.physical.impl.LowLevelTextObjectIterator;
+import org.primaresearch.dla.page.layout.physical.impl.RegionIterator;
+import org.primaresearch.dla.page.layout.physical.shared.ContentType;
+import org.primaresearch.dla.page.layout.physical.shared.LowLevelTextType;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextContainer;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextObject;
+import org.primaresearch.dla.page.layout.shared.GeometricObject;
+import org.primaresearch.ident.Id;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.maths.geometry.Dimension;
+import org.primaresearch.maths.geometry.Polygon;
+
+/**
+ * Class representing the layout and text content of a document page.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class PageLayout {
+
+ private Dimension size = new Dimension();
+
+ private GeometricObject border = null;
+ private GeometricObject printSpace = null;
+
+ private ContentFactory contentFactory;
+
+ private IndexedMap regions = new IndexedMapImpl();
+
+ private ReadingOrder readingOrder = null;
+ private Layers layers = null;
+ private Relations relations = null;
+
+ private static Comparator contentObjectSizeComparator = null;
+
+ /**
+ * Constructor
+ * @param contentFactory Internal factory to create content such as regions, lines, reading order, ...
+ */
+ public PageLayout(ContentFactory contentFactory) {
+ this.contentFactory = contentFactory;
+ }
+
+ /**
+ * Creates a region of the specified type.
+ * @param type Type of region (e.g. TextRegion, ImageRegion, ...)
+ * @return Region object
+ */
+ public Region createRegion(RegionType type) {
+ return createRegion(type, null);
+ }
+
+ /**
+ * Creates a region of the specified type.
+ * @param type Type of region (e.g. TextRegion, ImageRegion, ...)
+ * @return Region object
+ */
+ public Region createRegion(RegionType type, String id) {
+ return createRegion(type, id, null);
+ }
+
+ /**
+ * Creates a region of the specified type.
+ * @param type Type of region (e.g. TextRegion, ImageRegion, ...)
+ * @param id Preferred ID for the region (not guaranteed, check the returned region for the actual ID)
+ * @param parentRegion Parent region (for nesting of regions)
+ * @return Region object
+ */
+ public Region createRegion(RegionType type, String id, RegionContainer parentRegion) {
+ Region reg = (Region)contentFactory.createContent(type);
+ if (id != null) {
+ try {
+ reg.setId(id);
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+ if (parentRegion == null )
+ regions.put(reg.getId(), reg);
+ else
+ parentRegion.addRegion(reg);
+ return reg;
+ }
+
+ /**
+ * Returns the number of regions in this page layout.
+ */
+ public int getRegionCount() {
+ return regions.size();
+ }
+
+ /**
+ * Returns the region at the specified index.
+ * @throws IndexOutOfBoundsException
+ */
+ public Region getRegion(int index) {
+ return regions.getAt(index);
+ }
+
+ /**
+ * Returns the region with the given ID.
+ */
+ public Region getRegion(Id regionId) {
+ if (regionId == null)
+ return null;
+ return regions.get(regionId);
+ }
+
+ /**
+ * Returns the region with the given ID.
+ */
+ public Region getRegion(String regionId) {
+ if (regionId == null)
+ return null;
+ try {
+ return regions.get(contentFactory.getIdRegister().getId(regionId));
+ } catch (InvalidIdException e) {
+ return null;
+ }
+ }
+
+ /**
+ * Looks for a region at the given position within the document page.
+ * @return A region object or null.
+ */
+ public Region getRegionAt(int x, int y) {
+ List candidates = new LinkedList();
+ for (ContentIterator it = this.iterator(null); it.hasNext(); ) {
+ ContentObject region = it.next();
+ if (region.getCoords() != null) {
+ Polygon coords = region.getCoords();
+ if (coords.isPointInside(x, y)) {
+ candidates.add((Region)region);
+ }
+ }
+ }
+
+ if (candidates.size() > 1) {
+ //If multiple candidates -> sort by size and return smallest
+ Collections.sort(candidates, getContentObjectSizeComparator());
+ }
+ if (!candidates.isEmpty())
+ return candidates.get(0);
+
+ /*for (int i=0; inull if it could not be found
+ */
+ public ContentObject getObject(ContentType type, String id) {
+ if (type instanceof RegionType)
+ return getRegion(id);
+ else if (type instanceof LowLevelTextType) { //Text lines, word, glyph
+ for (int i=0; inull
+ */
+ public ContentObjectRelation getParentChildRelation(ContentType childType, String childId) {
+ if (childType instanceof RegionType)
+ return null;
+ else if (childType instanceof LowLevelTextType) { //Text lines, word, glyph
+ for (int i=0; itrue the ID will be removed from the ID register and is free to be used again
+ */
+ public void removeRegion(Id regionId, boolean unregisterId) {
+ if (regionId == null)
+ return;
+ regions.remove(regionId);
+ if (unregisterId)
+ this.contentFactory.getIdRegister().unregisterId(regionId);
+ }
+
+ /**
+ * Removes the region at the specified index from the page layout.
+ * @throws IndexOutOfBoundsException
+ */
+ public void removeRegion(int index) {
+ removeRegion(index, false);
+ }
+
+ /**
+ * Removes the region at the specified index from the page layout.
+ * @param unregisterId If set to true the ID will be removed from the ID register and is free to be used again
+ * @throws IndexOutOfBoundsException
+ */
+ public void removeRegion(int index, boolean unregisterId) {
+ Region reg = regions.removeAt(index);
+ if (unregisterId && reg != null)
+ this.contentFactory.getIdRegister().unregisterId(reg.getId());
+ }
+
+ /**
+ * Returns a sorted list of all regions. The sorting is primarily done by reading order and secondarily by y position.
+ * @return List of region objects
+ */
+ public List getRegionsSorted() {
+ List sortedRegions = new ArrayList(this.getRegionCount());
+
+ List notInReadingOrder = new ArrayList();
+
+ if (readingOrder != null) {
+ addRegionsFromReadingOrder(readingOrder.getRoot(), sortedRegions);
+
+ //Save ids in a set for fast lookup
+ Set idSet = new HashSet();
+ for (int i=0; i list) {
+ if (group == null)
+ return;
+ //Children
+ for (int i=0; inull for an iterator that includes all regions.
+ * @return The iterator
+ */
+ public ContentIterator iterator(ContentType contentType) {
+ return iterator(contentType, null);
+ }
+
+ /**
+ * Returns a new iterator for a specific page content type.
+ * @param contentType A specific region type or low level text object type. Use null for an iterator that includes all regions.
+ * @param layer Restrict the iterator to this layer (use null for no restriction)
+ * @return The iterator
+ */
+ public ContentIterator iterator(ContentType contentType, Layer layer) {
+ if (contentType == null || contentType instanceof RegionType)
+ return new RegionIterator(this, (RegionType)contentType, layer);
+ else if (contentType instanceof LowLevelTextType)
+ return new LowLevelTextObjectIterator(this, (LowLevelTextType)contentType, layer);
+ throw new IllegalArgumentException("Unsupported content type for iterator");
+ }
+
+ /**
+ * Creates a comparator using the bounding box area of content objects
+ * @return Comparator object
+ */
+ private static Comparator getContentObjectSizeComparator() {
+ if (contentObjectSizeComparator == null) {
+ contentObjectSizeComparator = new Comparator() {
+ @Override
+ public int compare(ContentObject o1, ContentObject o2) {
+ if (o1 == null || o2 == null || o1.getCoords() == null || o2.getCoords() == null
+ || o1.getCoords().getSize() < 3 || o2.getCoords().getSize() < 3)
+ return 0;
+ return new Integer(o1.getCoords().getBoundingBox().getWidth()
+ * o1.getCoords().getBoundingBox().getWidth()).compareTo(
+ new Integer(o2.getCoords().getBoundingBox().getWidth()
+ * o2.getCoords().getBoundingBox().getWidth()));
+ }
+ };
+ }
+ return contentObjectSizeComparator;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ChainConverter.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ChainConverter.java
new file mode 100644
index 00000000..9f7af58a
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ChainConverter.java
@@ -0,0 +1,80 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.converter;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.io.FormatVersion;
+
+/**
+ * Meta converter representing a chain of converters.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class ChainConverter implements LayoutConverter {
+
+ private List converters = new ArrayList();
+
+ /**
+ * Adds a converter to the chain
+ * @param converter Converter object
+ */
+ public void addConverter(LayoutConverter converter) {
+ if (!converters.isEmpty() && !converters.get(converters.size()-1).getTargetVersion().equals(converter.getSourceVersion()))
+ throw new IllegalArgumentException("Source format version of given converter doesn't match the target format version of the last converter in the chain.");
+ converters.add(converter);
+ }
+
+ @Override
+ public FormatVersion getSourceVersion() {
+ return converters.get(0).getSourceVersion();
+ }
+
+ @Override
+ public FormatVersion getTargetVersion() {
+ return converters.get(converters.size()-1).getTargetVersion();
+ }
+
+ @Override
+ public List convert(PageLayout layout) {
+ List messages = new ArrayList();
+
+ for (int i=0; i localMsg = converters.get(i).convert(layout);
+ if (localMsg != null)
+ messages.addAll(localMsg);
+ }
+
+ return messages;
+ }
+
+ @Override
+ public List checkForCompliance(PageLayout layout) {
+ List messages = new ArrayList();
+
+ for (int i=0; i localMsg = converters.get(i).checkForCompliance(layout);
+ if (localMsg != null)
+ messages.addAll(localMsg);
+ }
+
+ return messages;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ConversionMessage.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ConversionMessage.java
new file mode 100644
index 00000000..cd128cd5
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ConversionMessage.java
@@ -0,0 +1,65 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.converter;
+
+/**
+ * Format conversion related message
+ *
+ * @author Christian Clausner
+ *
+ */
+public class ConversionMessage {
+
+ public static final int CONVERSION_GENERAL = 0;
+ public static final int CONVERSION_RESET_INVALID_ATTRIBUTE = 1;
+ public static final int CONVERSION_ADD_REQUIRED_REGION = 2;
+
+
+ private String text;
+ private int code;
+
+ /**
+ * Constructor for general message
+ * @param text Message content
+ */
+ public ConversionMessage(String text) {
+ this(text, CONVERSION_GENERAL);
+ }
+
+ /**
+ * Constructor for specific message code
+ * @param text Message content
+ * @param code Message code (see CONVERSION_... constants)
+ */
+ public ConversionMessage(String text, int code) {
+ this.text = text;
+ }
+
+ /**
+ * Returns the message content
+ */
+ public String getText() {
+ return text;
+ }
+
+ /**
+ * Returns the message code (see CONVERSION_... constants)
+ */
+ public int getCode() {
+ return code;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ConverterHub.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ConverterHub.java
new file mode 100644
index 00000000..699fcdf4
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/ConverterHub.java
@@ -0,0 +1,226 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.converter;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.primaresearch.dla.page.Page;
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.RegionContainer;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextContainer;
+import org.primaresearch.dla.page.layout.physical.text.LowLevelTextObject;
+import org.primaresearch.io.FormatModel;
+import org.primaresearch.io.FormatVersion;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.VariableMap;
+
+/**
+ * Central access point for converting page objects to specific format versions.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class ConverterHub {
+
+ /** Singleton instance */
+ private static ConverterHub instance = null;
+
+
+ /** Registered converters */
+ private Map> layoutConverters = new HashMap>();
+
+ /**
+ * Private constructor (Singleton)
+ */
+ private ConverterHub() {
+ //Register layout converters
+ addConverter(new Converter_2010_03_19_to_2010_01_12());
+ addConverter(new Converter_2010_01_12_to_2009_03_16());
+ addConverter(new Converter_2013_07_15_to_2010_03_19());
+
+ //TODO If more schemas are added, we could dynamically create chain converters.
+ //2010-03-19 to 2009-03-16
+ ChainConverter chain_2010_03_19_to_2009_03_16 = new ChainConverter();
+ chain_2010_03_19_to_2009_03_16.addConverter(new Converter_2010_03_19_to_2010_01_12());
+ chain_2010_03_19_to_2009_03_16.addConverter(new Converter_2010_01_12_to_2009_03_16());
+ addConverter(chain_2010_03_19_to_2009_03_16);
+
+ //2013-07-15 to 2010-01-12
+ ChainConverter chain_2013_07_15_to_2010_01_12 = new ChainConverter();
+ chain_2013_07_15_to_2010_01_12.addConverter(new Converter_2013_07_15_to_2010_03_19());
+ chain_2013_07_15_to_2010_01_12.addConverter(new Converter_2010_03_19_to_2010_01_12());
+ addConverter(chain_2013_07_15_to_2010_01_12);
+
+ //2013-07-15 to 2009-03-16
+ ChainConverter chain_2013_07_15_to_2009_03_16 = new ChainConverter();
+ chain_2013_07_15_to_2009_03_16.addConverter(new Converter_2013_07_15_to_2010_03_19());
+ chain_2013_07_15_to_2009_03_16.addConverter(new Converter_2010_03_19_to_2010_01_12());
+ chain_2013_07_15_to_2009_03_16.addConverter(new Converter_2010_01_12_to_2009_03_16());
+ addConverter(chain_2013_07_15_to_2009_03_16);
+
+ }
+
+ /**
+ * Returns singleton instance
+ */
+ public static ConverterHub getInstance() {
+ if (instance == null)
+ instance = new ConverterHub();
+ return instance;
+ }
+
+ /**
+ * Registers a converter.
+ */
+ private void addConverter(LayoutConverter converter) {
+ Map targets = layoutConverters.get(converter.getSourceVersion());
+ if (targets == null) {
+ targets = new HashMap();
+ layoutConverters.put(converter.getSourceVersion(), targets);
+ }
+ targets.put(converter.getTargetVersion(), converter);
+ }
+
+ /**
+ * Converts the given page (layout) to the specified target format (might change attributes, attribute values and attribute constraints).
+ *
+ * @param page Page object containing the layout
+ * @param targetModel Target model for a specific format version
+ * @return A list of conversion messages or null.
+ */
+ public static List convert(Page page, FormatModel targetModel) {
+ FormatVersion sourceVersion = page.getFormatVersion();
+ if (sourceVersion == null || sourceVersion.equals(targetModel.getVersion()))
+ return null;
+
+ //Layout conversion
+ List messages = null;
+ ConverterHub instance = getInstance();
+
+ LayoutConverter layoutConverter = instance.findConverter(page.getFormatVersion(), targetModel.getVersion());
+
+ if (layoutConverter != null)
+ messages = layoutConverter.convert(page.getLayout());
+
+ //Adapt existing attributes and constraints
+ adaptAttributes(page.getLayout(), targetModel);
+
+ return messages;
+ }
+
+ private static void adaptAttributes(PageLayout layout, FormatModel model) {
+ Map templates = model.getTypeAttributeTemplates();
+
+ for (int i=0; i templates) {
+ for (int i=0; i templates) {
+ for (int i=0; i templates) {
+ VariableMap attributes = obj.getAttributes();
+ if (attributes == null || attributes.getType() == null)
+ return;
+ VariableMap template = templates.get(attributes.getType());
+ if (template != null) {
+ //Remove not supported attributes and update constraints
+ for (int i=0; i checkForCompliance(Page page, FormatVersion targetVersion) {
+ FormatVersion sourceVersion = page.getFormatVersion();
+ if (sourceVersion == null || sourceVersion.equals(targetVersion))
+ return null;
+
+ List messages = null;
+ ConverterHub instance = getInstance();
+
+ LayoutConverter converter = instance.findConverter(page.getFormatVersion(), targetVersion);
+
+ if (converter != null)
+ messages = converter.checkForCompliance(page.getLayout());
+
+ return messages;
+ }
+
+ //TODO If more schemas are added, we could dynamically create chain converters.
+ /**
+ * Tries to find a converter matching the given source and target versions.
+ * @return Converter object or null
+ */
+ private LayoutConverter findConverter(FormatVersion source, FormatVersion target) {
+ Map targets = layoutConverters.get(source);
+ if (targets == null)
+ return null;
+ LayoutConverter conv = targets.get(target);
+ return conv;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/Converter_2010_01_12_to_2009_03_16.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/Converter_2010_01_12_to_2009_03_16.java
new file mode 100644
index 00000000..c270665d
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/Converter_2010_01_12_to_2009_03_16.java
@@ -0,0 +1,82 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.converter;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.shared.RegionType;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.io.FormatVersion;
+import org.primaresearch.io.xml.XmlFormatVersion;
+
+/**
+ * Converter for 2010-01-12 format to 2009-03-16 format.
+ *
+ *
+ *
Adds temp region if there is no region (at least one region required)
+ *
+ * @author Christian Clausner
+ *
+ */
+public class Converter_2010_01_12_to_2009_03_16 implements LayoutConverter {
+
+ @Override
+ public FormatVersion getSourceVersion() {
+ return new XmlFormatVersion("2010-01-12");
+ }
+
+ @Override
+ public FormatVersion getTargetVersion() {
+ return new XmlFormatVersion("2009-03-16");
+ }
+
+ @Override
+ public List convert(PageLayout layout) {
+ return run(layout, false);
+ }
+
+ @Override
+ public List checkForCompliance(PageLayout layout) {
+ return run(layout, true);
+ }
+
+ /**
+ * Runs check or conversion
+ * @param checkOnly If true, no conversion is carried out (dry run).
+ */
+ public List run(PageLayout layout, boolean checkOnly) {
+ List messages = new ArrayList();
+
+ //Add a temporary region if there is no region at all
+ if (layout.getRegionCount() == 0) {
+ Region reg = layout.createRegion(RegionType.TextRegion);
+ try {
+ if (!checkOnly)
+ reg.setId("r"+ContentObject.TEMP_ID_SUFFIX);
+ messages.add(new ConversionMessage("Added temporary text region", ConversionMessage.CONVERSION_ADD_REQUIRED_REGION));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+
+ return messages;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/Converter_2010_03_19_to_2010_01_12.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/Converter_2010_03_19_to_2010_01_12.java
new file mode 100644
index 00000000..5f05af34
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/Converter_2010_03_19_to_2010_01_12.java
@@ -0,0 +1,101 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.converter;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.Region;
+import org.primaresearch.dla.page.layout.physical.text.impl.TextRegion;
+import org.primaresearch.io.FormatVersion;
+import org.primaresearch.io.xml.XmlFormatVersion;
+import org.primaresearch.shared.variable.Variable;
+import org.primaresearch.shared.variable.Variable.WrongVariableTypeException;
+
+/**
+ * Converter for 2010-03-19 format to 2010-01-12 format.
+ *
+ *
+ *
Removes unsupported text types (signature-mark, catch-word, marginalia, footnote, footnote-continued, TOC-entry)
+ *
+ * @author Christian Clausner
+ *
+ */
+public class Converter_2010_03_19_to_2010_01_12 implements LayoutConverter {
+
+ @Override
+ public FormatVersion getSourceVersion() {
+ return new XmlFormatVersion("2010-03-19");
+ }
+
+ @Override
+ public FormatVersion getTargetVersion() {
+ return new XmlFormatVersion("2010-01-12");
+ }
+
+ @Override
+ public List convert(PageLayout layout) {
+ return run(layout, false);
+ }
+
+ @Override
+ public List checkForCompliance(PageLayout layout) {
+ return run(layout, true);
+ }
+
+ /**
+ * Runs check or conversion
+ * @param checkOnly If true, no conversion is carried out (dry run).
+ */
+ public List run(PageLayout layout, boolean checkOnly) {
+ List messages = new ArrayList();
+
+ //Remove text type values:
+ // signature-mark
+ // catch-word
+ // marginalia
+ // footnote
+ // footnote-continued
+ // TOC-entry
+ for (int i=0; i
+ *
+ *
Converts unsupported regions to 'Unknown' (Music, Chem, Advert)
+ *
Removes unsupported attributes from regions, lines, words, and glyphs
+ *
+ * @author Christian Clausner
+ *
+ */
+public class Converter_2013_07_15_to_2010_03_19 implements LayoutConverter {
+
+ @Override
+ public FormatVersion getSourceVersion() {
+ return new XmlFormatVersion("2013-07-15");
+ }
+
+ @Override
+ public FormatVersion getTargetVersion() {
+ return new XmlFormatVersion("2010-03-19");
+ }
+
+ @Override
+ public List convert(PageLayout layout) {
+ return run(layout, false);
+ }
+
+ @Override
+ public List checkForCompliance(PageLayout layout) {
+ return run(layout, true);
+ }
+
+ /**
+ * Runs check or conversion
+ * @param checkOnly If true, no conversion is carried out (dry run).
+ */
+ public List run(PageLayout layout, boolean checkOnly) {
+ List messages = new ArrayList();
+
+ //Regions
+ List unsupportedRegions = new ArrayList();
+ for (ContentIterator it = layout.iterator(null); it.hasNext(); ) {
+ Region reg = (Region)it.next();
+
+ //Graphic types frame, barcode, decoration
+ if (reg.getType().equals(RegionType.GraphicRegion)
+ && ("frame".equals(((GraphicRegion)reg).getGraphicType())
+ || "barcode".equals(((GraphicRegion)reg).getGraphicType())
+ || "decoration".equals(((GraphicRegion)reg).getGraphicType()))) {
+
+ if (!checkOnly)
+ ((GraphicRegion)reg).setGraphicType(null);
+
+ messages.add(new ConversionMessage("Reset unsupported graphic type for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ //Text region types endnote, other
+ if (reg.getType().equals(RegionType.TextRegion)
+ && ("endnote".equals(((TextRegion)reg).getTextType())
+ || "other".equals(((TextRegion)reg).getTextType())
+ )) {
+
+ if (!checkOnly)
+ ((TextRegion)reg).setTextType(null);
+
+ messages.add(new ConversionMessage("Reset unsupported text type for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ //Colours
+ try {
+ Variable v = reg.getAttributes().get("penColour");
+ if (v != null && v.getValue().equals(new StringValue("other"))) {
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported colour for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ v = reg.getAttributes().get("bgColour");
+ if (v != null && v.getValue() != null && v.getValue().equals(new StringValue("other"))) {
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported colour for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ v = reg.getAttributes().get("lineColour");
+ if (v != null && v.getValue() != null && v.getValue().equals(new StringValue("other"))) {
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported colour for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ v = reg.getAttributes().get("colour");
+ if (v != null && v.getValue() != null && v.getValue().equals(new StringValue("other"))) {
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported colour for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ v = reg.getAttributes().get("textColour");
+ if (v != null && v.getValue() != null && v.getValue().equals(new StringValue("other"))) {
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported colour for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+ } catch (Exception e) {
+ e.printStackTrace();
+ }
+
+ //Colour depth
+ try {
+ Variable v = reg.getAttributes().get("colourDepth");
+ if (v != null && v.getValue() != null && v.getValue().equals(new StringValue("other"))) {
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported colour depth for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+ } catch (Exception e) {
+ e.printStackTrace();
+ }
+
+ //Language
+ try {
+ Variable v = reg.getAttributes().get("primaryLanguage");
+ if (v != null && v.getValue() != null) {
+ if (!v.getValue().equals(new StringValue("other"))
+ && !v.getValue().equals(new StringValue("other"))
+ && !v.getValue().equals(new StringValue("Afrikaans"))
+ && !v.getValue().equals(new StringValue("Albanian"))
+ && !v.getValue().equals(new StringValue("Amharic"))
+ && !v.getValue().equals(new StringValue("Arabic"))
+ && !v.getValue().equals(new StringValue("Basque"))
+ && !v.getValue().equals(new StringValue("Bengali"))
+ && !v.getValue().equals(new StringValue("Bulgarian"))
+ && !v.getValue().equals(new StringValue("Cambodian"))
+ && !v.getValue().equals(new StringValue("Cantonese"))
+ && !v.getValue().equals(new StringValue("Chinese"))
+ && !v.getValue().equals(new StringValue("Czech"))
+ && !v.getValue().equals(new StringValue("Danish"))
+ && !v.getValue().equals(new StringValue("Dutch"))
+ && !v.getValue().equals(new StringValue("English"))
+ && !v.getValue().equals(new StringValue("Estonian"))
+ && !v.getValue().equals(new StringValue("Finnish"))
+ && !v.getValue().equals(new StringValue("French"))
+ && !v.getValue().equals(new StringValue("German"))
+ && !v.getValue().equals(new StringValue("Greek"))
+ && !v.getValue().equals(new StringValue("Gujarati"))
+ && !v.getValue().equals(new StringValue("Hebrew"))
+ && !v.getValue().equals(new StringValue("Hindi"))
+ && !v.getValue().equals(new StringValue("Hungarian"))
+ && !v.getValue().equals(new StringValue("Icelandic"))
+ && !v.getValue().equals(new StringValue("Gaelic"))
+ && !v.getValue().equals(new StringValue("Italian"))
+ && !v.getValue().equals(new StringValue("Japanese"))
+ && !v.getValue().equals(new StringValue("Korean"))
+ && !v.getValue().equals(new StringValue("Latin"))
+ && !v.getValue().equals(new StringValue("Latvian"))
+ && !v.getValue().equals(new StringValue("Malay"))
+ && !v.getValue().equals(new StringValue("Norwegian"))
+ && !v.getValue().equals(new StringValue("Polish"))
+ && !v.getValue().equals(new StringValue("Portuguese"))
+ && !v.getValue().equals(new StringValue("Punjabi"))
+ && !v.getValue().equals(new StringValue("Russian"))
+ && !v.getValue().equals(new StringValue("Spanish"))
+ && !v.getValue().equals(new StringValue("Swedish"))
+ && !v.getValue().equals(new StringValue("Thai"))
+ && !v.getValue().equals(new StringValue("Turkish"))
+ && !v.getValue().equals(new StringValue("Urdu"))
+ && !v.getValue().equals(new StringValue("Welsh"))
+ && !v.getValue().equals(new StringValue("other"))
+ ) {
+
+ if (!checkOnly)
+ v.setValue(null);
+ messages.add(new ConversionMessage("Reset unsupported language for region '"+reg.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+
+ }
+ }
+ } catch (Exception e) {
+ e.printStackTrace();
+ }
+
+ //Advert, Chem and Music
+ if (reg.getType().equals(RegionType.AdvertRegion)
+ || reg.getType().equals(RegionType.ChemRegion)
+ || reg.getType().equals(RegionType.MusicRegion)) {
+ unsupportedRegions.add(reg);
+ }
+
+ }
+
+ //Handle unsupported regions
+ for (Iterator it = unsupportedRegions.iterator(); it.hasNext(); ) {
+ Region unsupported = it.next();
+
+ if (!checkOnly) {
+ layout.removeRegion(unsupported.getId(), true);
+ Region unknownRegion = layout.createRegion(RegionType.UnknownRegion, unsupported.getId().toString());
+ unknownRegion.setCoords(unsupported.getCoords());
+ }
+ messages.add(new ConversionMessage("Changed region type to 'unknown' for region '"+unsupported.getId()+"'", ConversionMessage.CONVERSION_RESET_INVALID_ATTRIBUTE));
+ }
+
+ //Unsupported attributes
+ // Regions
+ for (ContentIterator it = layout.iterator(null); it.hasNext(); ) {
+ ContentObject obj = it.next();
+ if (obj == null)
+ continue;
+ VariableMap atts = obj.getAttributes();
+ if (atts != null) {
+ atts.remove("custom");
+ atts.remove("comments");
+ atts.remove("production");
+ atts.remove("fontFamily");
+ atts.remove("bold");
+ atts.remove("italic");
+ atts.remove("underlined");
+ atts.remove("subscript");
+ atts.remove("superscript");
+ atts.remove("strikethrough");
+ atts.remove("smallCaps");
+ atts.remove("letterSpaced");
+ }
+ }
+ // Lines
+ for (ContentIterator it = layout.iterator(LowLevelTextType.TextLine); it.hasNext(); ) {
+ ContentObject obj = it.next();
+ if (obj == null)
+ continue;
+ VariableMap atts = obj.getAttributes();
+ if (atts != null) {
+ atts.remove("custom");
+ atts.remove("comments");
+ atts.remove("primaryLanguage");
+ atts.remove("production");
+ atts.remove("fontFamily");
+ atts.remove("bold");
+ atts.remove("italic");
+ atts.remove("underlined");
+ atts.remove("subscript");
+ atts.remove("superscript");
+ atts.remove("strikethrough");
+ atts.remove("smallCaps");
+ atts.remove("letterSpaced");
+ atts.remove("serif");
+ atts.remove("monospace");
+ atts.remove("fontSize");
+ atts.remove("kerning");
+ atts.remove("textColour");
+ atts.remove("bgColour");
+ atts.remove("reverseVideo");
+ }
+ }
+ // Words
+ for (ContentIterator it = layout.iterator(LowLevelTextType.Word); it.hasNext(); ) {
+ ContentObject obj = it.next();
+ if (obj == null)
+ continue;
+ VariableMap atts = obj.getAttributes();
+ if (atts != null) {
+ atts.remove("custom");
+ atts.remove("comments");
+ atts.remove("language");
+ atts.remove("production");
+ atts.remove("fontFamily");
+ atts.remove("bold");
+ atts.remove("italic");
+ atts.remove("underlined");
+ atts.remove("subscript");
+ atts.remove("superscript");
+ atts.remove("strikethrough");
+ atts.remove("smallCaps");
+ atts.remove("letterSpaced");
+ atts.remove("serif");
+ atts.remove("monospace");
+ atts.remove("fontSize");
+ atts.remove("kerning");
+ atts.remove("textColour");
+ atts.remove("bgColour");
+ atts.remove("reverseVideo");
+ }
+ }
+ // Glyphs
+ for (ContentIterator it = layout.iterator(LowLevelTextType.Glyph); it.hasNext(); ) {
+ ContentObject obj = it.next();
+ if (obj == null)
+ continue;
+ VariableMap atts = obj.getAttributes();
+ if (atts != null) {
+ atts.remove("custom");
+ atts.remove("comments");
+ atts.remove("production");
+ atts.remove("fontFamily");
+ atts.remove("bold");
+ atts.remove("italic");
+ atts.remove("underlined");
+ atts.remove("subscript");
+ atts.remove("superscript");
+ atts.remove("strikethrough");
+ atts.remove("smallCaps");
+ atts.remove("letterSpaced");
+ atts.remove("serif");
+ atts.remove("monospace");
+ atts.remove("fontSize");
+ atts.remove("kerning");
+ atts.remove("textColour");
+ atts.remove("bgColour");
+ atts.remove("reverseVideo");
+ }
+ }
+
+
+ return messages;
+ }
+
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/LayoutConverter.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/LayoutConverter.java
new file mode 100644
index 00000000..d9bbd52e
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/converter/LayoutConverter.java
@@ -0,0 +1,53 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.converter;
+
+import java.util.List;
+
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.io.FormatVersion;
+
+/**
+ * Interface for converters that convert a page layout to comply with a certain format version.
+ *
+ * @author Christian Clausner
+ *
+ */
+public interface LayoutConverter {
+
+ /**
+ * Format version before the conversion
+ */
+ public FormatVersion getSourceVersion();
+
+ /**
+ * Format version after the conversion
+ */
+ public FormatVersion getTargetVersion();
+
+ /**
+ * Converts the given page layout to the specified target format.
+ * @return A list of conversion messages
+ */
+ public List convert(PageLayout layout);
+
+ /**
+ * Checks if the given page layout is consistent to the target format version
+ * of this converter.
+ * @return A list of inconsistencies
+ */
+ public List checkForCompliance(PageLayout layout);
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/ContentObjectRelation.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/ContentObjectRelation.java
new file mode 100644
index 00000000..00722f31
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/ContentObjectRelation.java
@@ -0,0 +1,131 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.logical;
+
+import org.primaresearch.dla.page.layout.physical.ContentObject;
+
+/**
+ * Represents a relation between two page content objects (e.g. parent-child relation).
+ *
+ * @author Christian Clausner
+ *
+ */
+public class ContentObjectRelation {
+
+ private ContentObject object1;
+ private ContentObject object2;
+ private RelationType relationType;
+ private String customField;
+ private String comments;
+
+ /**
+ * Constructor
+ *
+ * @param object1 Page content object one
+ * @param object2 Page content object two
+ * @param relation Relation between object one and object two
+ */
+ public ContentObjectRelation(ContentObject object1, ContentObject object2, RelationType relation) {
+ this.object1 = object1;
+ this.object2 = object2;
+ this.relationType = relation;
+ }
+
+ public ContentObject getObject1() {
+ return object1;
+ }
+
+ public ContentObject getObject2() {
+ return object2;
+ }
+
+ public RelationType getRelationType() {
+ return relationType;
+ }
+
+ /**
+ * Returns custom content
+ */
+ public String getCustomField() {
+ return customField;
+ }
+
+ /**
+ * Sets custom content
+ */
+ public void setCustomField(String customField) {
+ this.customField = customField;
+ }
+
+ /**
+ * Returns comments
+ * @return Comments text
+ */
+ public String getComments() {
+ return comments;
+ }
+
+ /**
+ * Sets comments
+ * @param comments Comments text
+ */
+ public void setComments(String comments) {
+ this.comments = comments;
+ }
+
+
+
+
+ /**
+ * Relation type for page content objects.
+ *
+ * @author Christian Clausner
+ *
+ */
+ public static class RelationType {
+
+ /**
+ * Parent-child relation (e.g. word-glyph)
+ */
+ public static final RelationType ParentChildRelation = new RelationType("ParentChildRelation");
+
+ /**
+ * Weak relation (e.g. image-caption)
+ */
+ public static final RelationType Link = new RelationType("link");
+
+ /**
+ * Strong relation (e.g. drop capital - following text region or two parts of a word that was been wrapped)
+ */
+ public static final RelationType Join = new RelationType("join");
+
+ private String id;
+ private RelationType(String id) {
+ this.id = id;
+ }
+
+ @Override
+ public boolean equals(Object other) {
+ if (other instanceof RelationType)
+ return id.equals(((RelationType)other).id);
+ return false;
+ }
+
+ public String toString() {
+ return id;
+ }
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Group.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Group.java
new file mode 100644
index 00000000..5ba05e54
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Group.java
@@ -0,0 +1,251 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.logical;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.ContentFactory;
+import org.primaresearch.ident.Id;
+import org.primaresearch.ident.IdRegister;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+import org.primaresearch.ident.Identifiable;
+
+/**
+ * A logical group within a page layout (e.g. a reading order group). Groups can also be GroupMemebers.
+ *
+ * @author Christian Clausner
+ */
+public class Group implements GroupMember, Identifiable {
+
+ private PageLayout layout;
+ private IdRegister idRegister;
+ private ContentFactory contentFactory;
+ private boolean canHaveGroupsAsChildren;
+ private Group parentGroup;
+ private Id id;
+ private String caption;
+ private boolean ordered;
+ private List members = new ArrayList();
+
+ /**
+ * Constructor
+ * @param layout Page layout the group belongs to
+ * @param idRegister ID register (needed when creating child groups)
+ * @param contentFactory Factory needed when creating child groups
+ * @param id Group ID
+ * @param parentGroup Parent group (null for root)
+ * @param canHaveGroupsAsChildren Set to true to allow child groups
+ */
+ Group(PageLayout layout, IdRegister idRegister, ContentFactory contentFactory, Id id, Group parentGroup, boolean canHaveGroupsAsChildren) {
+ this.layout = layout;
+ this.idRegister = idRegister;
+ this.contentFactory = contentFactory;
+ this.id = id;
+ this.parentGroup = parentGroup;
+ this.canHaveGroupsAsChildren = canHaveGroupsAsChildren;
+ }
+
+ /**
+ * Returns the parent of this group or null if it is a root group
+ */
+ public Group getParent() {
+ return parentGroup;
+ }
+
+ @Override
+ public Id getId() {
+ return id;
+ }
+
+ /**
+ * Returns the caption (display name)
+ */
+ public String getCaption() {
+ return caption;
+ }
+
+ /**
+ * Sets the caption (display name)
+ */
+ public void setCaption(String caption) {
+ this.caption = caption;
+ }
+
+ /**
+ * Returns the 'ordered' state of this group
+ * @return true if an ordered group; false if an unordered group
+ */
+ public boolean isOrdered() {
+ return ordered;
+ }
+
+ /**
+ * Sets the 'ordered' state of this group
+ * @param ordered Set to true for an ordered group or false for an unordered group
+ */
+ public void setOrdered(boolean ordered) {
+ this.ordered = ordered;
+ }
+
+ /**
+ * Returns the size of this group
+ * @return Number of members
+ */
+ public int getSize() {
+ return members.size();
+ }
+
+ /**
+ * Returns the group member at the given position
+ * @param index Position
+ * @return Group member object
+ */
+ public GroupMember getMember(int index) {
+ return members.get(index);
+ }
+
+ /**
+ * Creates a group and adds it as child to this group
+ * @return The new group
+ * @throws Exception The group is not allowed to have children
+ */
+ public Group createChildGroup() throws Exception {
+ if (!canHaveGroupsAsChildren)
+ throw new Exception("");
+ Group group = new Group(layout, idRegister, contentFactory, idRegister.generateId("g"), this, canHaveGroupsAsChildren);
+ members.add(group);
+ return group;
+ }
+
+ /**
+ * Adds a reference to a region as group member
+ * @param id Region ID
+ */
+ public void addRegionRef(String id) {
+ try {
+ members.add(new RegionRef(this, contentFactory.getIdRegister().getId(id)));
+ } catch (InvalidIdException e) {
+ e.printStackTrace();
+ }
+ }
+
+ /**
+ * Removes a reference to a region
+ * @param id Region ID
+ */
+ public void removeRegionRef(String id) {
+ if (members != null) {
+ GroupMember toRemove = null;
+ for (Iterator it = members.iterator(); it.hasNext(); ) {
+ GroupMember member = it.next();
+ if (member instanceof RegionRef) {
+ if (((RegionRef)member).getRegionId().equals(id)) {
+ toRemove = member;
+ break;
+ }
+ }
+ }
+
+ if (toRemove != null)
+ members.remove(toRemove);
+ }
+ }
+
+ /**
+ * Recursively checks if this group or a child group contains a region reference with the given ID.
+ *
+ * @param regionId ID of referenced region
+ * @return True, if a reference has been found; false otherwise.
+ */
+ public boolean containsRegionRef(Id regionId) {
+ if (members != null) {
+ for (Iterator it = members.iterator(); it.hasNext(); ) {
+ GroupMember member = it.next();
+ if (member instanceof RegionRef) {
+ if (((RegionRef)member).getRegionId().equals(regionId))
+ return true;
+ } else { //if (member instanceof Group)
+ if (((Group)member).containsRegionRef(regionId)) //Recursion
+ return true;
+ }
+ }
+ }
+ return false;
+ }
+
+ /**
+ * Adds the given group member
+ */
+ public void add(GroupMember member) {
+ members.add(member);
+ }
+
+ @Override
+ public IdRegister getIdRegister() {
+ return idRegister;
+ }
+
+ @Override
+ public void setId(String id) throws InvalidIdException {
+ this.id = idRegister.registerId(id, this.id);
+ }
+
+ @Override
+ public void setId(Id id) throws InvalidIdException {
+ idRegister.registerId(id, this.id);
+ this.id = id;
+ }
+
+ @Override
+ public void moveTo(Group newParent) {
+ parentGroup.remove(this);
+ newParent.add(this);
+ }
+
+ /**
+ * Removes the specified member from this group.
+ * This method does not unregister the ID of a group.
+ * Intended for internal use (e.g. moveTo() of GroupMember).
+ *
+ * @return true, if the member has been found and removed, false otherwise
+ */
+ boolean remove(GroupMember member) {
+ for (int i=0; i layers = new ArrayList();
+
+ /**
+ * Constructor
+ * @param layout Page layout which the layers are intended for
+ * @param idRegister ID register (for creating layers)
+ * @param contentFactory Content factory (for creating layers)
+ */
+ public Layers(PageLayout layout, IdRegister idRegister, ContentFactory contentFactory) {
+ this.layout = layout;
+ this.idRegister = idRegister;
+ this.contentFactory = contentFactory;
+ }
+
+ /**
+ * Returns the number of layers.
+ */
+ public int getSize() {
+ return layers.size();
+ }
+
+ /**
+ * Returns the layer at the given index.
+ * @throws IndexOutOfBoundsException
+ */
+ public Layer getLayer(int index) {
+ return layers.get(index);
+ }
+
+ /**
+ * Creates and returns a new layer.
+ */
+ public Layer createLayer() {
+ Layer layer;
+ try {
+ layer = new Layer(layout, idRegister, contentFactory, idRegister.generateId("lay"));
+ layers.add(layer);
+ return layer;
+ } catch (InvalidIdException e) {
+ }
+ return null;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/ReadingOrder.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/ReadingOrder.java
new file mode 100644
index 00000000..525c4e4d
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/ReadingOrder.java
@@ -0,0 +1,66 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.logical;
+
+import org.primaresearch.dla.page.layout.PageLayout;
+import org.primaresearch.dla.page.layout.physical.ContentFactory;
+import org.primaresearch.ident.Id;
+import org.primaresearch.ident.IdRegister;
+import org.primaresearch.ident.IdRegister.InvalidIdException;
+
+/**
+ * Class for logical reading order of layout regions.
+ * The root group provides access to the actual reading order members.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class ReadingOrder {
+
+ private Group root;
+
+ /**
+ * Constructor
+ * @param layout Page layout the reading order is intended for
+ * @param idRegister ID register (for creating groups)
+ * @param contentFactory Content factory (for creating groups)
+ */
+ public ReadingOrder(PageLayout layout, IdRegister idRegister, ContentFactory contentFactory) {
+ try {
+ root = new Group(layout, idRegister, contentFactory, idRegister.generateId("g"), null, true);
+ } catch (InvalidIdException e) {
+ }
+ }
+
+ /**
+ * Returns the root group (the reading order always has a root group)
+ */
+ public Group getRoot() {
+ return root;
+ }
+
+ /**
+ * Checks the the region with the given ID is referenced in the reading order.
+ *
+ * @param regionId ID of referenced region
+ * @return True, if the region has been found; false otherwise
+ */
+ public boolean contains(Id regionId) {
+ if (root != null)
+ return root.containsRegionRef(regionId);
+ return false;
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/RegionRef.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/RegionRef.java
new file mode 100644
index 00000000..eda4385c
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/RegionRef.java
@@ -0,0 +1,60 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.logical;
+
+import org.primaresearch.ident.Id;
+
+/**
+ * A group member pointing to an Identifiable object (e.g. a region).
+ * @author Christian Clausner
+ *
+ */
+public class RegionRef implements GroupMember {
+
+ private Id regionId;
+ private Group parentGroup;
+
+ /**
+ * Constructor
+ * @param parentGroup Parent group (this constructor does NOT add the RegionRef object to the parent group)
+ * @param regionId ID of referenced region
+ */
+ RegionRef(Group parentGroup, Id regionId) {
+ this.parentGroup = parentGroup;
+ this.regionId = regionId;
+ }
+
+ /**
+ * Returns the ID of the referenced region
+ */
+ public Id getRegionId() {
+ return regionId;
+ }
+
+ @Override
+ public Group getParent() {
+ return parentGroup;
+ }
+
+ /**
+ * Moves this group member to another group.
+ */
+ @Override
+ public void moveTo(Group newParent) {
+ parentGroup.remove(this);
+ newParent.add(this);
+ }
+}
diff --git a/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Relations.java b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Relations.java
new file mode 100644
index 00000000..62a8985f
--- /dev/null
+++ b/java/PrimaDla/src/org/primaresearch/dla/page/layout/logical/Relations.java
@@ -0,0 +1,82 @@
+/*
+ * Copyright 2014 PRImA Research Lab, University of Salford, United Kingdom
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.primaresearch.dla.page.layout.logical;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.Set;
+
+import org.primaresearch.ident.Id;
+
+/**
+ * Container class for relations between content objects.
+ *
+ * @author Christian Clausner
+ *
+ */
+public class Relations {
+
+ private Map> relations = new HashMap>();
+
+ /**
+ * Checks if there are relations in this container
+ * @return true if empty, false otherwise
+ */
+ public boolean isEmpty() {
+ return relations.isEmpty();
+ }
+
+ /**
+ * Adds a relation to this container.
+ * @param relation Relation object to add
+ */
+ public void addRelation(ContentObjectRelation relation) {
+ Map targetMap = relations.get(relation.getObject1().getId());
+ if (targetMap == null) {
+ targetMap = new HashMap();
+ relations.put(relation.getObject1().getId(), targetMap);
+ }
+ targetMap.put(relation.getObject2().getId(), relation);
+ }
+
+ /**
+ * Returns the relation for the objects with id1 and id2 or 'null', if no such relation exists.
+ */
+ public ContentObjectRelation getRelation(Id id1, Id id2) {
+ Map targetMap = relations.get(id1);
+ if (targetMap != null) {
+ return targetMap.get(id2);
+ } else {
+ return getRelation(id2, id1);
+ }
+ }
+
+ /**
+ * Exports a set of all relations.
+ */
+ public Set exportRelations() {
+ Set rels = new HashSet();
+ for (Iterator