QA Specs: How to deal with consecutive white spaces #237

mweidling · 2023-02-09T10:07:57Z

The specification currently makes no suggestion on how to deal with more than one consecutive white space character.

M3ssman · 2023-06-06T20:19:21Z

IIRC, at GT discussions it was said that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.

Maybe @tboenig can bring more light into this.

bertsky · 2023-06-06T20:30:19Z

that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.

that would be consistent with our GT transcription guidelines. This also ensures that the text at line level is a mere concatenation of the text at word level, interspersed by single spaces.

I would add the important special case of whitespace at the start and end of the line: these should be stripped.

The technical background for all this is that by principle, LSTMs cannot reliably (learn to) represent a sequence of white spaces, because there is nothing overt/visual that could be propagated. So forcing multiple whitespaces during training can be expected to make the models less robust – not only around whitespace, but also in general. And metrics in turn influence how models are made and evaluated.

kba · 2023-06-08T16:30:52Z

I also think that

consecutive whitespace should be normalized and
trailing/leading whitespace removed

i.e.

sed -e 's,^\s*,,' -e 's,\s*$,,' -e 's,\s{2,}, ,g'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Specs: How to deal with consecutive white spaces #237

QA Specs: How to deal with consecutive white spaces #237

mweidling commented Feb 9, 2023

M3ssman commented Jun 6, 2023

bertsky commented Jun 6, 2023

kba commented Jun 8, 2023

QA Specs: How to deal with consecutive white spaces #237

QA Specs: How to deal with consecutive white spaces #237

Comments

mweidling commented Feb 9, 2023

M3ssman commented Jun 6, 2023

bertsky commented Jun 6, 2023

kba commented Jun 8, 2023