Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Individual orthographies not detected by CharsetChecker #185

Open
arialcrime opened this issue Nov 18, 2024 · 1 comment
Open

Individual orthographies not detected by CharsetChecker #185

arialcrime opened this issue Nov 18, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@arialcrime
Copy link
Contributor

arialcrime commented Nov 18, 2024

It might very well be that I am misunderstanding/misusing something, but I think there is something odd about hyperglot’s CharsetChecker when it comes to Serbian.

Somehow it does not detect support for the individual orthographies. It only comes back True when both Latin and Cyrillic is supported.

Should I be approaching this somehow differently?
Is there a way for me to get the individual orthographies return True in this case?

Thanks for your help!

Here is some sample code to illustrate the issue I ran into:
(disclaimer: hyperglot feeding itself is for demostration purposes only here)

from hyperglot.languages import Languages
from hyperglot.checker import CharsetChecker

tag = "srp" # different, but somewhat comparable issue also showed up with "uzn"

char_dict = dict()

for o in Languages()[tag]["orthographies"]:
    if not o["script"] in char_dict:
        char_dict[o["script"]] = ""
    # combine base, aux, and marks into one string to use in checking later
    char_dict[o["script"]] = " ".join([o["base"], o["auxiliary"], o["marks"]])

# go through scripts individually and check for support
for script, characters in char_dict.items():
    print(script.upper(), CharsetChecker(characters).supports_language(tag)) # check_all_orthographies=False does not change things here

# Combining all characters across scripts returns True
all_chars = " ".join([c for s, c in char_dict.items()])
print("ALL", CharsetChecker(all_chars).supports_language(tag))

Hyperglot version: 0.7.1

@kontur
Copy link
Contributor

kontur commented Nov 26, 2024

This is expected behavior, the logic being that both Cyrillic and Latin are indeed required to faithfully support Serbian (see wiki). It is my understanding that either script is valid and it is up to the publisher/media which they use. While Serbian can be written with either script, it is our interpretation that to claim support for Serbian as a whole, both scripts are required. There are only a few cases of such a requirement in the entire database, but where it is, this is intentional. Of course, feel free to debate this decision. :)

Technically, it's the preferred_as_group option, as documented in the README here:

preferred_as_group (optional, defaults to false) will combine all orthographies of this language. When used, a language is detected as supported only if all its orthographies with this attribute are supported. This is used for Serbian to require both Cyrillic-script and Latin-script orthographies to be supported and for Japanese to require Hiragana, Katakana, and Kanji orthographies to be supported.

And for Serbian in particular it is set here and here.

Does this make sense?

@kontur kontur self-assigned this Nov 26, 2024
@kontur kontur added the question Further information is requested label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants