Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Specs: How to deal with consecutive white spaces #237

Open
mweidling opened this issue Feb 9, 2023 · 3 comments
Open

QA Specs: How to deal with consecutive white spaces #237

mweidling opened this issue Feb 9, 2023 · 3 comments

Comments

@mweidling
Copy link
Contributor

The specification currently makes no suggestion on how to deal with more than one consecutive white space character.

@M3ssman
Copy link

M3ssman commented Jun 6, 2023

IIRC, at GT discussions it was said that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.

Maybe @tboenig can bring more light into this.

@bertsky
Copy link
Collaborator

bertsky commented Jun 6, 2023

that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.

that would be consistent with our GT transcription guidelines. This also ensures that the text at line level is a mere concatenation of the text at word level, interspersed by single spaces.

I would add the important special case of whitespace at the start and end of the line: these should be stripped.

The technical background for all this is that by principle, LSTMs cannot reliably (learn to) represent a sequence of white spaces, because there is nothing overt/visual that could be propagated. So forcing multiple whitespaces during training can be expected to make the models less robust – not only around whitespace, but also in general. And metrics in turn influence how models are made and evaluated.

@kba
Copy link
Member

kba commented Jun 8, 2023

I also think that

  • consecutive whitespace should be normalized and
  • trailing/leading whitespace removed

i.e.

sed -e 's,^\s*,,' -e 's,\s*$,,' -e 's,\s{2,}, ,g'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants