-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Morphological tags need verification / sanity checks #14
Comments
I see that I didn't read the README carefully enough:
The precision of some of the features (particularly number and gender) is very low because the other columns definitely do not provide enough information to predict this (unless you also have a lexicon and a morphological analyzer, of course). If this is the approach, many cases that currently have annotation should be underspecified instead and some of the rules need to be updated. As an example, German has zero plurals, so identical forms and lemmas does not mean a word is singular. As a result, I would argue that this kind of rule-based derivation of morphological features from insufficient evidence is worse than having no morphological information. It adds no value to the corpus, since you can just derive these (incorrect) features with a few rules, and misleads developers by providing so much incorrect data. It makes no sense to evaluate a morphological analyzer on this data as you intend to in the upcoming shared task. I could potentially understand if an underresourced language used such an approach, but there is no need for German annotation to look like this. There is no lack of lexical resources and morphological analyzers available that could be used here. |
Tiger itself has manual gold morphological tags (at least case, number, gender, tense etc.) - wouldn't using those be the best, at least for anything from Tiger? If non Tiger data needs to be annotated too, I think there are also decent RFTagger/Marmot models trained on the gold data which should perform much better than these rules. |
The problem with German is not the lack of tools,but the lack of manpower. Bear in mind that UD is an open community effort with no dedicated funding. Hence, we are completely dependent on contributions from the community, and it has proven surprisingly hard to find someone who is willing to assume responsibility for cleaning up German. If anyone is interested, please let us know. :) |
#14 Improving POSTAG and FEATS with mate and Tiger Thanks a lot for the great work!
Along with improvements from TIGER suggested in #13, all the morphological tags could benefit from some verification and basic sanity checks. Things like:
Function words:
sind
: 278 singular (389 plural)der
: 5 acc sing, 1 nom plwird
: 8 pluraleines
: 2 dat, 2 acc, 1 pl (238 gen sing)einem
: 1 nomCommon gender-specific noun endings:
-ung
: 34 plural (2663 singular)-ungen
: 70 singular (558 plural)-[hk]eit
: 6 plural (340 singular)-[hk]eiten
: 10 singular (89 plural)Simple NPs like
die Arbeitgeber
are frequently feminine singular even though the NP can only be plural.How were the morphological tags generated? They seem to rely too much on questionable parse trees or tags on words that are ambiguous out of context. (
Zimmer sind...
as singular whereZimmer
tagged incorrectly as singular leads tosind
being singular?) Coordinated subjects also seem to lead to lots of inconsistencies in verb number, where it looks like one singular feature in the coordination is somehow passed down to the verb?I can see some cases where grammatical errors make the choice of tags/features complicated (i.e., what information from the three possible sources (distribution, morphological marking, lexical stem) do you rely on? see Diaz-Negrillo et al. (2009)), but these cases are rare in comparison to grammatical / unambiguous cases with obvious errors.
The text was updated successfully, but these errors were encountered: