-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format question #9
Comments
This comes from the different sources. The HDT development started ~20 years ago with scraped HTML3 and we converted the already tokenized HDT to UD. Obtaining untokenized text is possible (the correspondences are in some extra files) but not trivial. Trying to automatically un-tokenize would be weird as everyone else could do that themselves as well and it would make the impression that it would be the original even though it is not. Therefore, we do not provide a non-tokenized text. |
The text attribute should contain the untokenized sentence. For treebanks where it is available, it contains the original text before any processing started, which is of course highly preferable. However, there are many other treebanks for which the original is not available, and in these treebanks the de-tokenization has been estimated automatically using heuristics. So if HDT goes that way, it will definitely not be the only treebank to do so. I would say that this approach is preferable (while the README should explain that it is not the original). If people train an end-to-end model, such as UDPipe, on a treebank where there are spaces between all tokens, they will get a model that is useless. |
bump - the tokenization is still essentially useless for training a model as in @dan-zeman 's explanation |
Hi guys,
I'm currently wondering why the "text" line contains the tokenized sentence, whereas "text" line in the German GSD corpus contains the un-tokenized sentence.
Is there any format specification that defines if the tokenized/untokenized sentence should be used there 🤔
The text was updated successfully, but these errors were encountered: