Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe errors in source texts in sentences from TIGER #12

Open
adrianeboyd opened this issue Feb 5, 2018 · 2 comments
Open

Severe errors in source texts in sentences from TIGER #12

adrianeboyd opened this issue Feb 5, 2018 · 2 comments
Assignees
Labels

Comments

@adrianeboyd
Copy link

There are major errors (words missing, punctuation in the wrong order) in sentences derived from the TIGER treebank. Something seems to have gone terribly wrong in pre-processing steps or in the dependency conversion process.

Looking for nearly identical sentences in UD and TIGER, I can find around 1450 sentences that appear to come from TIGER and about 740 of these contain errors in the source text. Approximately 310 only concern punctuation, while the remaining ~430 additionally involve missing sentence-initial words.

The errors are not equally distributed across the UD subcorpora. They affect 2% of the training corpus, 17% of the development corpus, and 29% (!) of the test corpus. A full half of the sentences with missing initial words are in the test corpus, which means that 22% of sentences in the test corpus are missing the first word in the sentence.

The problems are described in detail below, but here is a quick summary of the distribution of errors:

Subcorpus Punctuation Only Missing Words (and maybe also Punct.)
Train 162 131
Dev 80 85
Test 63 220

Because the errors involve missing and misordered tokens, fixing things would require a fair amount of reannotation. I don't know what is reasonable to do/expect within the constraints of the UD project and obviously some annotation errors and noise are expected in any corpus, but this seems egregious, especially to this degree in the dev/test corpora.

These kinds of artificially ill-formed sentences do not really seem to be representative of German, which is concerning when the UD corpora are being used more and more for development and evaluation. I would at a minimum propose marking the problematic sentences somehow, especially the ones with missing words, so that developers can exclude them as desired.

Problems

The problems I've found:

  1. The first token is missing in many sentences
  2. The order of adjacent sentence-internal punctuation tokens is reversed
  3. Hyphens from compounds have been converted to --
  4. ASCII double quote " (a character that does not appear in TIGER*) is added at the beginning and/or end of a full sentence (where often the first word is missing, too) or appears as a normalization of `` sentence-initially (almost exclusively in the test corpus)
  5. Ordinal numbers are split incorrectly (or at the very least inconsistently) into two tokens (e.g., 22 . Oktober)

Examples

Here is a sentence that shows problems 1-3 (train-s2181):

Chef Andy Grove sieht die größte Herausforderung darin `` , alles zu 
tun , um die Zahl der Nutzer in der PC -- Welt zu steigern '' .

The original sentence from TIGER is:

Ihr Chef Andy Grove sieht die größte Herausforderung darin , `` alles zu 
tun , um die Zahl der Nutzer in der PC-Welt zu steigern '' .
  • Ihr is missing
  • , `` is reversed
  • PC-Welt becomes PC -- Welt

Here is another sentence (test-s544) with problems 3-4:

" verletzt wurde eine Korrespondentin des deutschen ARD -- Fernsehens .

And the original TIGER sentence:

Leicht verletzt wurde eine Korrespondentin des deutschen ARD-Fernsehens .

As you would expect, the missing words often lead to sentences that are not well-formed (test-s545):

" Verletzungen können Zahlungen und Handelserleichterungen künftig 
ausgesetzt werden .

Instead of:

Bei Verletzungen können Zahlungen und Handelserleichterungen künftig 
ausgesetzt werden .

(Verletzungen is annotated as dep and has no morphological features.)

And the even more entertaining (test-s374):

deutscher Touristin muß lebenslang in Haft

Instead of:

Mörder deutscher Touristin muß lebenslang in Haft

(Touristin is indeed annotated as nsubj, deutscher and Touristin are Case=Nom, and deutscher is somehow Degree=cmp,pos!)

Detailed Results

After normalizing emdash "--" vs. "-", ignoring cases that result in matched rather than mismatched quotes, and skipping full sentences that were merely embedded in longer sentences, I have found:

  • 302 sentences that differ only in punctuation presence, appearance, and/or order
  • 432 sentences that differ in initial words (and maybe also punctuation)
  • 10 sentences that differ in final or both initial and final tokens (and maybe also punctuation)

I've attached a summary of the mismatches with the following columns:

  1. Category / Degree (0: only punctuation, 1: punctuation + sentence-initial < ~25 letters, 2: punctuation + sentence-initial > ~25 letters, 3: other overlap)
  2. UD Sentence ID
  3. UD Tokens
  4. TIGER Tokens (hyphenated compounds remain unsplit)

I've removed a number of cases by hand that were accidentally caught by my simple heuristics or that didn't seem problematic (typically a full sentence from a quote within a longer sentence, with an initial list numbering or dash, or with an intro like Auch: or FR: or Richter: or Klartext:). I've left a few cases in categories 1-3 where there are differences in punctuation within an embedded sentence (so they are more like category 0 in effect, which is reflected in the counts in the table in the introduction). I would not be surprised if there are still some errors in this list, either cases that are not problematic or cases from TIGER that I didn't detect.

*To be accurate: ASCII double quotes do appear a few times in TIGER, but they look like mistakes.

ud-tiger-misalignments.csv.txt

@dan-zeman
Copy link
Member

Thanks for the detailed analysis! I will look into this issue (although I agree that it may be difficult to fix everything).

@dan-zeman dan-zeman self-assigned this Feb 5, 2018
@dan-zeman dan-zeman added the bug label Feb 5, 2018
adrianeboyd pushed a commit to adrianeboyd/UD_German-GSD that referenced this issue Mar 22, 2018
adrianeboyd pushed a commit to adrianeboyd/UD_German-GSD that referenced this issue Mar 22, 2018
Reinserting FixTigerDep comments for APPRART insertions that still need
to be reannotated. Additionally a few internal marks on inserted tokens
and lemmas have been removed as intended after manual
inspection/updates.
@dan-zeman
Copy link
Member

Some of the issues may have been fixed in #15 and #16 but we should check #17 (or the source at master...adrianeboyd:UD_German-GSD:mateposfeats-tiger-inserted) for any fixes that did not make it to the dev branch.

Furthermore, there are still 360 instances of FixTigerDep=Yes in the MISC column. Those should be checked manually before closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants