Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphological tags need verification / sanity checks #14

Closed
adrianeboyd opened this issue Feb 5, 2018 · 3 comments
Closed

Morphological tags need verification / sanity checks #14

adrianeboyd opened this issue Feb 5, 2018 · 3 comments

Comments

@adrianeboyd
Copy link

Along with improvements from TIGER suggested in #13, all the morphological tags could benefit from some verification and basic sanity checks. Things like:

Function words:

  • sind: 278 singular (389 plural)
  • der: 5 acc sing, 1 nom pl
  • wird: 8 plural
  • eines: 2 dat, 2 acc, 1 pl (238 gen sing)
  • einem: 1 nom

Common gender-specific noun endings:

  • words ending in -ung: 34 plural (2663 singular)
  • words ending in -ungen: 70 singular (558 plural)
  • words ending in -[hk]eit: 6 plural (340 singular)
  • words ending in -[hk]eiten: 10 singular (89 plural)

Simple NPs like die Arbeitgeber are frequently feminine singular even though the NP can only be plural.

How were the morphological tags generated? They seem to rely too much on questionable parse trees or tags on words that are ambiguous out of context. (Zimmer sind... as singular where Zimmer tagged incorrectly as singular leads to sind being singular?) Coordinated subjects also seem to lead to lots of inconsistencies in verb number, where it looks like one singular feature in the coordination is somehow passed down to the verb?

I can see some cases where grammatical errors make the choice of tags/features complicated (i.e., what information from the three possible sources (distribution, morphological marking, lexical stem) do you rely on? see Diaz-Negrillo et al. (2009)), but these cases are rare in comparison to grammatical / unambiguous cases with obvious errors.

@adrianeboyd
Copy link
Author

I see that I didn't read the README carefully enough:

Morphological features were assigned using rules based on the values of the other columns (UPOSTAG, XPOSTAG, LEMMA, FORM, DEPREL). Gender, number and case of nouns and their det/amod children are based on the (manual) syntactic annotation, e.g. nsubj => nominative. They should have high precision but lower recall because we did not add them where the context did not provide enough clues (morphological analyzer / lexicon was not used at this stage).

The precision of some of the features (particularly number and gender) is very low because the other columns definitely do not provide enough information to predict this (unless you also have a lexicon and a morphological analyzer, of course).

If this is the approach, many cases that currently have annotation should be underspecified instead and some of the rules need to be updated. As an example, German has zero plurals, so identical forms and lemmas does not mean a word is singular. As a result, Zimmer should have no Number value. Die Wagen should be underspecified instead of feminine singular (it is actually masculine plural). Coordinated subjects should have plural verbs.

I would argue that this kind of rule-based derivation of morphological features from insufficient evidence is worse than having no morphological information. It adds no value to the corpus, since you can just derive these (incorrect) features with a few rules, and misleads developers by providing so much incorrect data. It makes no sense to evaluate a morphological analyzer on this data as you intend to in the upcoming shared task.

I could potentially understand if an underresourced language used such an approach, but there is no need for German annotation to look like this. There is no lack of lexical resources and morphological analyzers available that could be used here.

@amir-zeldes
Copy link

Tiger itself has manual gold morphological tags (at least case, number, gender, tense etc.) - wouldn't using those be the best, at least for anything from Tiger?

If non Tiger data needs to be annotated too, I think there are also decent RFTagger/Marmot models trained on the gold data which should perform much better than these rules.

@jnivre
Copy link

jnivre commented Feb 7, 2018

The problem with German is not the lack of tools,but the lack of manpower. Bear in mind that UD is an open community effort with no dedicated funding. Hence, we are completely dependent on contributions from the community, and it has proven surprisingly hard to find someone who is willing to assume responsibility for cleaning up German. If anyone is interested, please let us know. :)

dan-zeman added a commit that referenced this issue Mar 27, 2018
#14 Improving POSTAG and FEATS with mate and Tiger

Thanks a lot for the great work!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants