-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/pluralization #28
Conversation
Rename translations to pluralizations Add documentation about pluralization Add new format for duration translations jarbas changes: - revert func rename + revert file deletions (keep fallback logic) - add enums
Codecov Report
@@ Coverage Diff @@
## dev #28 +/- ##
=====================================
Coverage ? 0.00%
=====================================
Files ? 65
Lines ? 16472
Branches ? 0
=====================================
Hits ? 0
Misses ? 16472
Partials ? 0 Continue to review full report at Codecov.
|
860d1ca
to
36c4de1
Compare
port MycroftAI#36 + MycroftAI#37 add pt pluralizations.json add tests
36c4de1
to
3e7dd61
Compare
return word | ||
|
||
|
||
def get_plural_form_pt(word, amount, type=PluralCategory.CARDINAL): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be better worded "get_numerus_XX" (although latin it hints at the specific task - get me the correct numerus from word given the amount -, english: grammatical number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like that change! if this ever gets added in mycroft we can add back a get_plural_form method for compat with imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since grammar is a bitch, above is only working if the sentence is nominative case. In english language this might be not that relevant to the numerus (in general: flexion) of a noun, but unfortunately in german (depending on case, gender, ...)
This is a problem, since we don't know the sentence at this point. It seems to me that a localized "grammar checker" should be a part of the dialog renderer, checking the mustache replacements (and their direct dependencies [POS]) at construction time.
To give an example: Während ein(es) Monat(s) - singular, genitive - during one month
the case is here determined by "während".
If the dialog is written like "in {num} {period}" the correct flexion would be "in ein(em) Monat"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these new functions could be used internally for the dialog renderer in core, would you like to open an issue/PR in there for that discussion? :)
in regards to LF do you feel we need more helper functions? these more or less map to a standard so they are good to have, but if they are not enough let's explore further!
@NeonMariia is this related to the numeric cases we were talking about last week? |
In Ukrainian and Russian languages numerals change their form not only depending on the case or plural/singular form of the dependent word, but also on the numeral itself. There are several numeral groups (7 in Ukrainian, 4 in Russian) each of them has its own rules of getting flexions. That's why putting numerals in correct form in Ukrainian, Russian and some other synthetic languages requires dependency sentence parsing and POS tags determination. Below there are some examples of those:
|
Getting back to this, I will be refactoring the new methods to have a different signature and naming according to feedback above this def get_plural_form_XX(word, amount, type=PluralCategory.CARDINAL):
... becomes def get_numerus_XX(tokens, idx, amount, type=PluralCategory.CARDINAL):
word = tokens[idx]
... @NeonMariia would this be satisfactory? you could then use the tokens to perform postag or anything else you need initially i thought about sending the raw utterance instead of tokens, but that makes code a little more complex since we need to account for multiple instances of same word, find it's index and length etc. Is there any caveat to receiving a pre tokenized utterance? I want to introduce this pattern in the other places we pass a single word as argument, so let's discuss what the ideal signature would look like that allows any language to do it's own thing. I think the tokens + location of word allow to do any utterance parsing needed. but the devil is in the details, what about multi-word arguments? def get_numerus_XX(utterance, substr, idx, amount, type=PluralCategory.CARDINAL):
word = utterance[idx:idx+len(substr)]
... |
i don't get the last proposal. idx could be also of type But another thought. |
the aim is to future proof this for other random languages, is a postag enough? what if we dont have a postag model for this lang? does klingon need the previous 4 words? is there any language or better algorithm that needs the follow up words not the previous? passing along utterance + word position provides the language with everything it needs, then its up to the lang to use postags or something else if it needs it, some langs might get away with much simpler checks. in portuguese and english postag is not needed or can be worked around related, LF does not have the concept of postag yet, but this is being pluginified and already used in neon, so we will need to discuss OPM integration in LF at some point, probably using the global config to assign a plugin per language i suppose |
Yes, I think this is great implementation (when we have whole utterance and span as certain word) for the future grammatical analysis |
Certain language implementations would need a dependency model as well. I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills? The class Token should pipeline existing analytic models. If no plugin is existant, only pass the bare tokens. |
postag/dependency/whatever is internal to LF and used by the method, users/skills dont need to care about any of that a skill only passes along index is needed because the word may appear multiple times in the sentence, we can default index to None and in that case pick the first word maybe we can reorder args a bit, below could all be valid word = ...
utterance = ...
idx = N
plural = get_numerus(word, 2)
plural = get_numerus(word, 2, utterance)
plural = get_numerus(word, 2, utterance, idx) but if we support that then english skill devs will never pass utterance and idx, meaning lang support will be defective until someone sends a patch to the skill. maybe we should force all args even if its a little more cumbersome ? to make things simple maybe the signature should be |
I think we should have an opportunity to pass just word itself for more simple cases? |
So you think the snippet above is a good public facing api with all the optional arguments? It will require more docs and work in skills for lang support, as a dev i like it, but thinking in a larger scope im concerned skill devs will be lazy and this will result in hardcoded english support most of the times. |
Idk for me it seems that these optional arguments should not cause any problems (but i don't have such experience in skills lang support) |
port of the following PRs
MycroftAI#167
MycroftAI#36
MycroftAI#37