-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify how to deal with other characters (symbols, currencies, punctuation) #60
Comments
Could be an alternative orthography. |
I'd rather include them as optional attributes of existing orthographies and follow the same route as for #154 meaning define them on a script level, inherit on data access, and allow per-language/orthography overwrites if set. |
Punctuation and symbols are not alternative orthography. If you put it there it will make the actual alternate orthography data harder to use. Also you have really tricky situations with punctuation (even more so than the occasional alphabet snafu) with official vs. traditional data. For example it's pretty obvious now that Basic punctuation like Even more grey areas would be curly braces |
I meant the alternative orthography specifically for Tamil. I will enquire with our Tamil contact about this. He suggested to put them in auxiliary, initially, but it seems a bit too much. @alerque thank you for the input. I agree, punctuation gets wild. In #154 (sorry for the partial duplication) I have suggested something that could work: general definition per script, clarified preference (legal standard, dominant practice) in language. We are trying to find a “reasonably regular approach”. If you have suggestions or more tough cases, drop it there, please. Currencies are probably best handled by an independent check as they, imho, should be mapped to countries/states rather than languages. |
Yes, currencies are more of a country / locale issue rather than language / script. Obviously there is going to be some cross over. Isn't currency info in CLDR too? Importing those related symbols when the locale and language do map to each-other might still be quite useful. And yes, punctuation chaos like the en-dash / ampersand stuff I described are different kind of issue. |
A couple of rather unsorted thoughts on these issues:
To illustrate the last, let's imagine a "Latin" script level definition like:
and for example for French the primary orthography could look something like:
and have a "historic" orthography that additionally has:
This would be interpreted as: A Turkish orthography could look something like:
All Latin How does this sound? We could collect a few more pseudo definitions for other scripts and languages and see if this reveals some limitations. |
This issue of how to deal with non-grapheme characters gets more and more complicated the more I think about it. Some thoughts:
Here's roughly what I'm imagining for extending language definitions, using English as an example: name: English
orthographies:
- autonym: English
base: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Œ a b c d e f g h i j k l m n o p q r s t u v w x y z æ œ
auxiliary: À Á Ç È É Ê Ë Ï Ñ Ô Ö à á ç è é ê ë ï ñ ô ö
base_numerals: 0 1 2 3 4 5 6 7 8 9
base_symbols: "` ~ ! @ # $ % ^ & * ( ) - _ = + [ ] { } \\ | ; : ' \" < > , . ? /"
auxiliary_symbols: ƒ ¶ § © ® ™ ° ¦ † ‡ ¤
marks: ◌̀ ◌́ ◌̂ ◌̃ ◌̈ ◌̧
script: Latin
status: primary |
Thanks for the input! It is really hard to come up with a satisfying approach to this, so more opinions certainly are welcome!
Very good point. What I meant by this is that on a conceptual level there could be inheritance from the script. If you look at your example, almost all language of the Latin scripts would probably have the exact same Generally speaking I would aim for these categories to list only essential symbols, so no refined splitting into base/aux. For example for auxiliary symbols it is sheer impossible where to consistently draw the line, with regard to when a symbols is not "base" any more, but also when a symbol is obscure enough to not be included in "auxiliary". I'd rather approach this by listing only very certain ones in "base" and leave it to the understanding of the user that there is near endless symbols that may somehow be relevant in a language, but impossible to list exhaustively. Like with orthographies, HG should list what we are certain of. In your example, I would consider the (It's a question of notation, but we could even allow extending the default, instead of mere overwriting, e.g. define
HG has steered clear of this deliberately, so far. These kinds of definitions get argumentative very quickly, particularly with regard to historical, minority and diaspora type of use cases. I do see the benefit from a user perspective, though. The question of currency symbols remains, however. What I wonder is if there are false positives for when currencies are linked to languages. A naive example of those would be English, where HG orthographies do not distinguish between British and American English, so an English currency list would include both $, £ (and ¢). And less obvious, English is in frequent use as an administrative language or lingua franca, e.g. you might expect it to also include ₹ for use throughout India, etc.. Again, taking the most conservative approach would be to include only those symbols of which we have high certainty, so for English I'd argue this would be $ and £ but no other currency symbols. Someone interested in "localized" support of a language, in this case English in use in India, would be "missing" a warning about a font missing ₹, but then again, HG does not support this kind of localized check to begin with, so worrying about currency symbols in this context of is moot. Would there be cases where this approach would result in a false positive? Meaning are there orthographies in use in different locales with distinct currency symbols where including both/several would result in an unacceptable currency symbol requirement for the other locales? |
In #155 I have mentioned this:
I think this is a key decision we need to make. Do we provide:
* Anything else in the Latin script that we know varies across languages? (I am not talking about typesetting rules which are of course diverse.) My feeling is similar to @kontur ’s that symbols are of a script rather than of a language. Once we decide on that, we can devise a notation and take it from there. I have some thoughts on how to decide to keep it practical, but I need to ponder. The base list could be scraped from major corpora for each script with some frequency threshold and if particular. This way we avoid arbitrary subjective decisions. I consider currencies a separate question, linked to country. I would be happy if we had that list, though. If one wanted to provide an interface, ideally, one would ask two questions: (a) what languages does a user want to support and (b) what countries are they considering? |
More as a reminder that we are aware of these and that they are currently not included in the data.
E.g. #58 has Tamil symbols which seem like they would be required, or at least relevant, for language support, but there is no way (other than as
note
) to include this information in the current data scheme.The text was updated successfully, but these errors were encountered: