A tool that aligns text/ground truth to a given list of ConfMats
- Java >= version 8
- Maven
- All further dependencies are gathered via Maven
git clone https://github.com/CITlabRostock/CITlabTextAlignment
cd CITlabTextAlignment
mvn package [-DskipTests=true]
The most important class is de.uros.citlab.textalignment.TextAligner
.
The method
public List<LineMatch> getAlignmentResult(
List<String> refs,
List<ConfMat> recos
)
tries to align the references to the given ConfMats.
The Result List<LineMatch> result
is a list from the same size and order as the ConfMats.
When result.get(i) != null
, for the ConfMat recos.get(i)
a corresponding transcript is available.
The reference and confidence of ConfMat i
is available by using
LineMatch match = result.get(i);
String reference = match.getReference();
double confidence = match.getConfidence();
The confidence is in [0.0,1.0], whereas the higher the value, the better is the alignment. A confidence greater 0.1 can be seen as very trustful, whereas confidences lower 0.01 are viewed with caution.
To configure the alignment tool, several parameters are available. The most important parameters are required via the constructor:
public TextAligner(
String lineBreakCharacters,
Double costSkipWords,
Double costSkipConfMat,
Double costJumpConfMat
)
requires a lot of parameters which will be breefly explained.
- lineBreakCharacters
The algorithm requires a String with characters which can be interpreted als line break.
In most cases, the space character
\u+0020
is used, but also the character tabulation\u+0009
could make sense. In case that both characters should be used setString lineBreakCharacters = "\u0009\u0020"
. Note that settingString lineBreakCharacters = ""
is also possible, if only whole lines of the reference should be mapped to the ConfMats. It is not allowed to set the line feed\u+000A
as line break character. - costSkipWords
In some szenarios either there are transcripts available,
that do not occure in the ConfMats or ConfMats are missing.
Both cases require the need to skip transcripts
when it is too hard to align them to the ConfMats.
If
Double costSkipWords = null
, the algorithm is not allowed to skip words. So he is forced to read each reference in the ConfMats. IfDouble costSkipWords = 0.0
the algorithm will try to skip any word, if it is possible. A good value isDouble costSkipWords = 4.0
. - costSkipConfMat
In some szenarios either there are ConfMats available,
that have no corresponding transcripts or transcripts for ConfMats are not given.
In both cases it should be possible not to force the algorithm to align text to a ConfMat.
If
Double costSkipConfMat = null
, the algorithm is not allowed to skip a ConfMat, so he is forced to align the ConfMat to any available transcript. IfDouble costSkipConfMat = 0.0
, the algorithm will try to skip any ConfMat, if it is possible. A good value isDouble costSkipConfMat = 0.2
. - costJumpConfMat
In some szenarios the reading order of the transcripts and ConfMats is not consistent.
In these cases it is necessary to ignore the given reading order.
If
Double costJumpConfMat = null
the algorithm is not allowed to change the reading order. ForDouble costJumpConfMat = 0
the algorithm can chose an arbitrary reading order. Note that the complexity of the algorithm singnificantly increases whenDouble costJumpConfMat != null
and the alignment result is only a heuristic. Especially if the alignment task contains many short lines and ConfMats, the algorithm can fail. If a value> 0
is used the algorithm is penalized if it breaks the original reading order of the ConfMats. A good value isDouble costJumpConfMat = 6.0
.
-
threshold Instead of accepting only alignments with a specific threshold given by
lineMatch.getConfidence()
, the threshold can be set in advance by the methodpublic void setThreshold(double threshold)
. The default isthreshold = 0.0
. -
hyphenation In some cases the transcription contains text without hyphenations, whereas they occur "in" the ConfMats. It is possible to define the hyphenation propery
public HyphenationProperty( boolean skipSuffix, boolean skipPrefix, char[] prefixes, char[] suffixes, double hypCosts )
In general one can specify characters that were used as hyphenation signs (see
prefixes
andsuffixes
). In addition it is possible to make their occurance optional (seeskipSuffix
andskipPrefix
). To do not allow the algorithm to see hyphenations in too many places, extra costs for a hyphenation can be added. WithhypCosts = 0
there will be no extra costs, whereashypCosts = Double.POSITIVE_INFINITY
would permit any hyphenation. A good value ishypCosts = 6.0
. Examples:With the hyphenation property
new HyphenationProperty(false, false, null, new char[]{'-', '¬'}, 6.0)
the ground truth
hyphen
can be interpreted as"h-" "yphen"
,"h¬" "yphen"
,"hy-" "phen"
,"hy¬" "phen"
, ... ,"hyphe-" "n"
or"hyphe¬" "n"
.With the hyphenation property
new HyphenationProperty(false, true, new char[]{'='}, new char[]{'='}, 6.0)
the ground truth
hyphen
can be interpreted as"h=" "yphen"
,"h=" "=yphen"
,"hy=" "phen"
,"hy=" "=phen"
, ... ,"hyphe=" "n"
or"hyphe=" "=n"
.In fact hyphenations are not allowed between all characters (like
"h-" "yphen"
). Therefore, a language patttern can be provided so that hyphenations are restricted to language-specific properties. Sonew HyphenationProperty(false, false, null, new char[]{'-', '¬'}, 6.0, Hyphenator.HyphenationPattern.EN_US)
would only allow the hyphenations
"hy-" "phen"
and"hy¬" "phen"
. Note that theseHypenationPattern
can fail for special words, so that they have to be used with caution. -
debug output The alignment problem is solved by finding a shortest path through a graph. The graph can be embedded into the 2d-space: the concatenated ConfMats are placed along the y-dimension, the concatenated transcripts are placed along the x-dimension. The algorithm searchs the shortest (or cost-minimal) path through the alignment from top-left to bottom-right. This 2d-space can be plotted by setting
public void setDebugOutput(int size, File file)
with the desired size of the image (e. g.
size = 1000
) and the desired location to save the image after finishing the process. It is recommended to use png, because there is no data compressing. -
update scheme Solving the alignment problem the algorithm stops, when he found the best solution. by setting
textAligner.setUpdateScheme(PathCalculatorGraph.UpdateScheme.ALL);
it is possible to also calculate paths, that cannot lead to the best solution. This only makes sense if one is intereste in the debug output image, because calculating more paths costs more time and the result stays the same.