Feat/pluralization #28

NeonJarbas · 2022-07-13T23:16:51Z

port of the following PRs

Rename translations to pluralizations Add documentation about pluralization Add new format for duration translations jarbas changes: - revert func rename + revert file deletions (keep fallback logic) - add enums

codecov · 2022-07-13T23:37:32Z

Codecov Report

❗ No coverage uploaded for pull request base (dev@392cc37). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 860d1ca differs from pull request most recent head 3e7dd61. Consider uploading reports for the commit 3e7dd61 to get more accurate results

@@          Coverage Diff          @@
##             dev     #28   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      65           
  Lines          ?   16472           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   16472           
  Partials       ?       0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 392cc37...3e7dd61. Read the comment docs.

port MycroftAI#36 + MycroftAI#37 add pt pluralizations.json add tests

emphasize · 2022-08-30T12:37:27Z

lingua_franca/lang/format_pt.py

+    return word
+
+
+def get_plural_form_pt(word, amount, type=PluralCategory.CARDINAL):


this might be better worded "get_numerus_XX" (although latin it hints at the specific task - get me the correct numerus from word given the amount -, english: grammatical number)

i like that change! if this ever gets added in mycroft we can add back a get_plural_form method for compat with imports

Since grammar is a bitch, above is only working if the sentence is nominative case. In english language this might be not that relevant to the numerus (in general: flexion) of a noun, but unfortunately in german (depending on case, gender, ...)

This is a problem, since we don't know the sentence at this point. It seems to me that a localized "grammar checker" should be a part of the dialog renderer, checking the mustache replacements (and their direct dependencies [POS]) at construction time.

To give an example: Während ein(es) Monat(s) - singular, genitive - during one month
the case is here determined by "während".
If the dialog is written like "in {num} {period}" the correct flexion would be "in ein(em) Monat"

these new functions could be used internally for the dialog renderer in core, would you like to open an issue/PR in there for that discussion? :)

in regards to LF do you feel we need more helper functions? these more or less map to a standard so they are good to have, but if they are not enough let's explore further!

NeonDaniel · 2022-08-30T16:57:08Z

@NeonMariia is this related to the numeric cases we were talking about last week?

NeonMariia · 2022-08-31T11:32:48Z

In Ukrainian and Russian languages numerals change their form not only depending on the case or plural/singular form of the dependent word, but also on the numeral itself. There are several numeral groups (7 in Ukrainian, 4 in Russian) each of them has its own rules of getting flexions. That's why putting numerals in correct form in Ukrainian, Russian and some other synthetic languages requires dependency sentence parsing and POS tags determination. Below there are some examples of those:

Group I:

один (one) - changes flexions by gender, case and plural/singular forms of the following noun: одне, одна, одні, одного, однієї (1)

e.g. Ми бачили одного(1) хлопця. (We saw one guy.)
Numeral(1) depends on singular noun in genitive case (хлопця), according to this we put numeral (один -1) in certain form in genitive case (одного).

Group II:

Actually quantitative numerals (власне кількісні: два, три, чотири), (2, 3, 4), (two, three, four).
Change their flexions according to the case of the noun in plural form on which it depends.

e.g. По вулиці йдуть два(2) хлопці. (Two boys are walking on the street.)
Numeral (два - 2) depends on plural noun in nominative case (хлопці). That's why we put quantitative numeral (2 - два)

Composite numerals (збірні числівники: двоє, троє, четверо, п’ятеро, …) (2 – 20 і 30) (two, three, four -, twenty, thirty).
Change their flexions according to the case of the noun in plural form on which it depends. Can not be dependent on nouns in the nominative case.

e.g. По вулиці йшли двоє(2) хлопців. (Two boys were walking on the street.)
Numeral (двоє - 2) depends on plural noun in genitive case(хлопців). That's why we put composite numeral (2 - двоє)

JarbasAl · 2022-09-10T10:18:58Z

Getting back to this, I will be refactoring the new methods to have a different signature and naming according to feedback above

this

def get_plural_form_XX(word, amount, type=PluralCategory.CARDINAL):
     ...

becomes

def get_numerus_XX(tokens, idx, amount, type=PluralCategory.CARDINAL):
     word = tokens[idx]
     ...

@NeonMariia would this be satisfactory? you could then use the tokens to perform postag or anything else you need

initially i thought about sending the raw utterance instead of tokens, but that makes code a little more complex since we need to account for multiple instances of same word, find it's index and length etc.

Is there any caveat to receiving a pre tokenized utterance?

I want to introduce this pattern in the other places we pass a single word as argument, so let's discuss what the ideal signature would look like that allows any language to do it's own thing. I think the tokens + location of word allow to do any utterance parsing needed. but the devil is in the details, what about multi-word arguments?

def get_numerus_XX(utterance, substr, idx, amount, type=PluralCategory.CARDINAL):
     word = utterance[idx:idx+len(substr)]
     ...

emphasize · 2022-09-13T11:27:18Z

i don't get the last proposal. idx could be also of type range?

But another thought. ~~Is get_numerus needed as a user facing function? Or just an internal helper, without the~~ Is there a need to index at all? This only applies to nouns and given a correct POS tag we could return a dict of index/numerus pairs.

JarbasAl · 2022-09-13T11:59:34Z

Is there a need to index at all? This only applies to nouns and given a correct POS tag we could return a dict of index/numerus pairs.

the aim is to future proof this for other random languages, is a postag enough? what if we dont have a postag model for this lang? does klingon need the previous 4 words? is there any language or better algorithm that needs the follow up words not the previous? passing along utterance + word position provides the language with everything it needs, then its up to the lang to use postags or something else if it needs it, some langs might get away with much simpler checks. in portuguese and english postag is not needed or can be worked around

related, LF does not have the concept of postag yet, but this is being pluginified and already used in neon, so we will need to discuss OPM integration in LF at some point, probably using the global config to assign a plugin per language i suppose

NeonMariia · 2022-09-13T12:02:30Z

Yes, I think this is great implementation (when we have whole utterance and span as certain word) for the future grammatical analysis

emphasize · 2022-09-13T12:25:59Z

is a postag enough?

Certain language implementations would need a dependency model as well.

I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills?

The class Token should pipeline existing analytic models. If no plugin is existant, only pass the bare tokens.

JarbasAl · 2022-09-13T12:50:48Z

Certain language implementations would need a dependency model as well.

I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills?

postag/dependency/whatever is internal to LF and used by the method, users/skills dont need to care about any of that

a skill only passes along utterance, word, idx, amount,

index is needed because the word may appear multiple times in the sentence, we can default index to None and in that case pick the first word

maybe we can reorder args a bit, below could all be valid

word = ...
utterance = ...
idx = N
plural = get_numerus(word, 2)
plural = get_numerus(word, 2, utterance)
plural = get_numerus(word, 2, utterance, idx)

but if we support that then english skill devs will never pass utterance and idx, meaning lang support will be defective until someone sends a patch to the skill. maybe we should force all args even if its a little more cumbersome ?

to make things simple maybe the signature should be range, utterance, amount and all of those required

NeonMariia · 2022-09-13T13:58:11Z

Certain language implementations would need a dependency model as well.
I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills?

postag/dependency/whatever is internal to LF and used by the method, users/skills dont need to care about any of that

a skill only passes along utterance, word, idx, amount,

index is needed because the word may appear multiple times in the sentence, we can default index to None and in that case pick the first word

maybe we can reorder args a bit, below could all be valid
word = ...
utterance = ...
idx = N
plural = get_numerus(word, 2)
plural = get_numerus(word, 2, utterance)
plural = get_numerus(word, 2, utterance, idx)
but if we support that then english skill devs will never pass utterance and idx, meaning lang support will be defective until someone sends a patch to the skill. maybe we should force all args even if its a little more cumbersome ?

to make things simple maybe the signature should be range, utterance, amount and all of those required

I think we should have an opportunity to pass just word itself for more simple cases?

JarbasAl · 2022-09-13T14:05:01Z

So you think the snippet above is a good public facing api with all the optional arguments? It will require more docs and work in skills for lang support, as a dev i like it, but thinking in a larger scope im concerned skill devs will be lazy and this will result in hardcoded english support most of the times.

NeonMariia · 2022-09-13T15:01:39Z

So you think the snippet above is a good public facing api with all the optional arguments? It will require more docs and work in skills for lang support, as a dev i like it, but thinking in a larger scope im concerned skill devs will be lazy and this will result in hardcoded english support most of the times.

Idk for me it seems that these optional arguments should not cause any problems (but i don't have such experience in skills lang support)

Implement support for getting plural categories

74c60ec

Rename translations to pluralizations Add documentation about pluralization Add new format for duration translations jarbas changes: - revert func rename + revert file deletions (keep fallback logic) - add enums

NeonJarbas force-pushed the feat/pluralization branch from 860d1ca to 36c4de1 Compare July 13, 2022 23:45

JarbasAl added the enhancement New feature or request label Jul 13, 2022

get_plural_form_en/pt

3e7dd61

port MycroftAI#36 + MycroftAI#37 add pt pluralizations.json add tests

NeonJarbas force-pushed the feat/pluralization branch from 36c4de1 to 3e7dd61 Compare July 13, 2022 23:49

JarbasAl requested review from ChanceNCounter and NeonDaniel July 14, 2022 12:22

JarbasAl mentioned this pull request Jul 15, 2022

Implement support for getting plural categories MycroftAI/lingua-franca#167

Open

5 tasks

emphasize reviewed Aug 30, 2022

View reviewed changes

JarbasAl closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/pluralization #28

Feat/pluralization #28

NeonJarbas commented Jul 13, 2022

codecov bot commented Jul 13, 2022 •

edited

Loading

emphasize Aug 30, 2022 •

edited

Loading

JarbasAl Aug 30, 2022

emphasize Aug 30, 2022 •

edited

Loading

JarbasAl Aug 30, 2022

NeonDaniel commented Aug 30, 2022

NeonMariia commented Aug 31, 2022 •

edited

Loading

JarbasAl commented Sep 10, 2022 •

edited

Loading

emphasize commented Sep 13, 2022 •

edited

Loading

JarbasAl commented Sep 13, 2022 •

edited

Loading

NeonMariia commented Sep 13, 2022

emphasize commented Sep 13, 2022 •

edited

Loading

JarbasAl commented Sep 13, 2022 •

edited

Loading

NeonMariia commented Sep 13, 2022

JarbasAl commented Sep 13, 2022

NeonMariia commented Sep 13, 2022

		return word


		def get_plural_form_pt(word, amount, type=PluralCategory.CARDINAL):

Feat/pluralization #28

Feat/pluralization #28

Conversation

NeonJarbas commented Jul 13, 2022

codecov bot commented Jul 13, 2022 • edited Loading

Codecov Report

emphasize Aug 30, 2022 • edited Loading

Choose a reason for hiding this comment

JarbasAl Aug 30, 2022

Choose a reason for hiding this comment

emphasize Aug 30, 2022 • edited Loading

Choose a reason for hiding this comment

JarbasAl Aug 30, 2022

Choose a reason for hiding this comment

NeonDaniel commented Aug 30, 2022

NeonMariia commented Aug 31, 2022 • edited Loading

JarbasAl commented Sep 10, 2022 • edited Loading

emphasize commented Sep 13, 2022 • edited Loading

JarbasAl commented Sep 13, 2022 • edited Loading

NeonMariia commented Sep 13, 2022

emphasize commented Sep 13, 2022 • edited Loading

JarbasAl commented Sep 13, 2022 • edited Loading

NeonMariia commented Sep 13, 2022

JarbasAl commented Sep 13, 2022

NeonMariia commented Sep 13, 2022

codecov bot commented Jul 13, 2022 •

edited

Loading

emphasize Aug 30, 2022 •

edited

Loading

emphasize Aug 30, 2022 •

edited

Loading

NeonMariia commented Aug 31, 2022 •

edited

Loading

JarbasAl commented Sep 10, 2022 •

edited

Loading

emphasize commented Sep 13, 2022 •

edited

Loading

JarbasAl commented Sep 13, 2022 •

edited

Loading

emphasize commented Sep 13, 2022 •

edited

Loading

JarbasAl commented Sep 13, 2022 •

edited

Loading