-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe errors in source texts in sentences from TIGER #12
Comments
Thanks for the detailed analysis! I will look into this issue (although I agree that it may be difficult to fix everything). |
Reinserting FixTigerDep comments for APPRART insertions that still need to be reannotated. Additionally a few internal marks on inserted tokens and lemmas have been removed as intended after manual inspection/updates.
Some of the issues may have been fixed in #15 and #16 but we should check #17 (or the source at master...adrianeboyd:UD_German-GSD:mateposfeats-tiger-inserted) for any fixes that did not make it to the dev branch. Furthermore, there are still 360 instances of |
There are major errors (words missing, punctuation in the wrong order) in sentences derived from the TIGER treebank. Something seems to have gone terribly wrong in pre-processing steps or in the dependency conversion process.
Looking for nearly identical sentences in UD and TIGER, I can find around 1450 sentences that appear to come from TIGER and about 740 of these contain errors in the source text. Approximately 310 only concern punctuation, while the remaining ~430 additionally involve missing sentence-initial words.
The errors are not equally distributed across the UD subcorpora. They affect 2% of the training corpus, 17% of the development corpus, and 29% (!) of the test corpus. A full half of the sentences with missing initial words are in the test corpus, which means that 22% of sentences in the test corpus are missing the first word in the sentence.
The problems are described in detail below, but here is a quick summary of the distribution of errors:
Because the errors involve missing and misordered tokens, fixing things would require a fair amount of reannotation. I don't know what is reasonable to do/expect within the constraints of the UD project and obviously some annotation errors and noise are expected in any corpus, but this seems egregious, especially to this degree in the dev/test corpora.
These kinds of artificially ill-formed sentences do not really seem to be representative of German, which is concerning when the UD corpora are being used more and more for development and evaluation. I would at a minimum propose marking the problematic sentences somehow, especially the ones with missing words, so that developers can exclude them as desired.
Problems
The problems I've found:
--
"
(a character that does not appear in TIGER*) is added at the beginning and/or end of a full sentence (where often the first word is missing, too) or appears as a normalization of``
sentence-initially (almost exclusively in the test corpus)22 . Oktober
)Examples
Here is a sentence that shows problems 1-3 (train-s2181):
The original sentence from TIGER is:
Ihr
is missing, ``
is reversedPC-Welt
becomesPC -- Welt
Here is another sentence (test-s544) with problems 3-4:
And the original TIGER sentence:
As you would expect, the missing words often lead to sentences that are not well-formed (test-s545):
Instead of:
(
Verletzungen
is annotated asdep
and has no morphological features.)And the even more entertaining (test-s374):
Instead of:
(
Touristin
is indeed annotated asnsubj
,deutscher
andTouristin
areCase=Nom
, anddeutscher
is somehowDegree=cmp,pos
!)Detailed Results
After normalizing emdash "--" vs. "-", ignoring cases that result in matched rather than mismatched quotes, and skipping full sentences that were merely embedded in longer sentences, I have found:
I've attached a summary of the mismatches with the following columns:
I've removed a number of cases by hand that were accidentally caught by my simple heuristics or that didn't seem problematic (typically a full sentence from a quote within a longer sentence, with an initial list numbering or dash, or with an intro like
Auch:
orFR:
orRichter:
orKlartext:
). I've left a few cases in categories 1-3 where there are differences in punctuation within an embedded sentence (so they are more like category 0 in effect, which is reflected in the counts in the table in the introduction). I would not be surprised if there are still some errors in this list, either cases that are not problematic or cases from TIGER that I didn't detect.*To be accurate: ASCII double quotes do appear a few times in TIGER, but they look like mistakes.
ud-tiger-misalignments.csv.txt
The text was updated successfully, but these errors were encountered: