-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmas with two options #35
Comments
These look like ambiguous strings which could have either lemma if context is ignored, but the lemma is actually unambiguous in context. For example, the word "Montage" in the last example is ambiguous between "Mondays" and "mounting/assembly", but it is definitely the latter in context (the former would also have to be plural and the FEATS show you it isn't), so "Montage" (mounting) is the correct lemma in context. |
Ideally would those be resolved to single lemma options, then?
Also, does the upos help clarify, or only the feats?
…On Fri, Aug 9, 2024, 11:04 AM Amir Zeldes ***@***.***> wrote:
These look like ambiguous strings which could have either lemma if context
is ignored, but the lemma is actually unambiguous in context. For example,
the word "Montage" in the last example is ambiguous between "Mondays" and
"mounting/assembly", but it is definitely the latter in context (the former
would also have to be plural and the FEATS show you it isn't), so "Montage"
(mounting) is the correct lemma in context.
—
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWIUCPF7CHQLPCKVUVDZQTSAXAVCNFSM6AAAAABMG4B2A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGI3TOMJSGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Yes, these can all be disambiguated IMO, but only some of them are trivial or close to. Spielen can only have the Lemma Spiel if it's dative plural, so that's easy. Gedacht is non-trivial, though it's probably 99% denken. Cases of the verb gedenken usually have the rare genitive case object, but that's not 100% guaranteed. In practice, it's probably fine to say "denken unless it has a genitive dependent"? Speisen is Lemma Speise if it's plural, otherwise it's Speisen. Gefallen is maybe the hardest here since both verbs are not uncommon. I would say if it has an auxiliary with the lemma sein it's probably fallen, otherwise gefallen. |
There are 246 such ambiguous lemma strings (in 791 instances). Ideally they should be disambiguated; but I'm afraid it means mostly manual work. |
Is there room to pick the most likely one and put a notation of the
ambiguous lemma in the MISC column? It's kind of strange to have a
lemmatizer pick up this dataset and learn to write two possible tags for a
word.
Obviously we could do that data cleaning on our end before training
|
Feel free to do cleaning on your end. In any case, you are training on the output of an old, pre-neural lemmatizer, you now that? (Although some of the data points have been checked manually, the dataset as a whole is still in the category "Lemmas: automatic".) Picking the most likely one means you know what is the most likely one. In principle, you should answer that question 246 times, separately for each lemma string. I think in the end I will ignore the principle and try some heuristics that will target multiple lexemes at once. But I do not promise that the problem will disappear completely before the next release. |
Down to 122 lemma types, 455 instances. |
Thanks, the progress here is very helpful. In terms of automated lemmas... presumably there was some effort made to make those accurate? The goal is to memorize the known lemmas and try to predict the right lemma for a previously unseen word, a situation which makes the A|B lemmas rather distressing for our users. |
Came across a few words where the lemmas are apparently one of two options. This is a little inconvenient in terms of learning how to lemmatize German. Is there a way to unify these? For example, the
ge-
form of verbs is usually lemmatized without thege
, but for some of these examples it's allowing forms with or withoutge
to be the lemma.... there are others aside from these
The text was updated successfully, but these errors were encountered: