Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GT and PAGE spec: represent gap character #243

Open
bertsky opened this issue Mar 14, 2023 · 0 comments
Open

GT and PAGE spec: represent gap character #243

bertsky opened this issue Mar 14, 2023 · 0 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Mar 14, 2023

IMO we are still lacking a convention to represent illegible substrings. DTABf (TEI) uses gap for this.

Since there is a dependency from GT to OCR training to OCR inference to OCR postcorrection, we should make this as concrete as possible without breaking existing habits. For example, in public GT datasets you often see or £ to represent this directly in the string. The downside is obviously that you might somehow end up confusing these substitutes with their actual meaning after all.

If possible we could also try to enforce a non-printable character like ASCII bell, substitute or unit separator. In the simplest form, we just use the empty string – but that only works when transcribing on character level, and OCR is trained on line level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants