You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.
that would be consistent with our GT transcription guidelines. This also ensures that the text at line level is a mere concatenation of the text at word level, interspersed by single spaces.
I would add the important special case of whitespace at the start and end of the line: these should be stripped.
The technical background for all this is that by principle, LSTMs cannot reliably (learn to) represent a sequence of white spaces, because there is nothing overt/visual that could be propagated. So forcing multiple whitespaces during training can be expected to make the models less robust – not only around whitespace, but also in general. And metrics in turn influence how models are made and evaluated.
The specification currently makes no suggestion on how to deal with more than one consecutive white space character.
The text was updated successfully, but these errors were encountered: