Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for punctuation in the data and detection #155

Open
MrBrezina opened this issue Jan 16, 2024 · 1 comment
Open

Support for punctuation in the data and detection #155

MrBrezina opened this issue Jan 16, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@MrBrezina
Copy link
Member

I feel there is more flexibility in using punctuation (over numerals, see #154) and therefore it is trickier to give preference.

Punctuation could be:

  • required/preferred form (base), or
  • optional (auxiliary).

For example, «» may be preferred in French, but they can optionally be used in German. Maybe it is not worth collecting the optional punctuation at all (is it useful?).

The idea was to define general set of punctuation for the script and preference for the language. A language could inherit, extend, or completely override the script setting.

It is not clear to me whether these would be two records, e.g. punctuation and preferred punctuation (or even quote preference), or a single record. How is this related to design requirements/alternates?

It is also not clear what is still considered punctuation, or rather, useful punctuation to keep track of. Period, comma, quotes, parentheses, section sign, … are required to a different extent.

@MrBrezina
Copy link
Member Author

MrBrezina commented Aug 30, 2024

I will dump my thoughts here and then we can add some of it to README in a more concise form.

The reason for minimality of orthographies is that we want to minimize false negatives, i.e. font being reported as NOT support ing a language when in fact it can be used for that language. The other day I had someone suggesting that what Hyperglot is doing is in fact easy and can be done by looking at existing fonts that are known to support a particular language. The false negatives are exactly why we cannot do that. Similarly, this is also why dealing with conjuncts should not be done by way of expressing a complete “script grammar”, but rather by analysing actual texts. This also makes Hyperglot’s somewhat different from quality control which may want to stick to best practices that are effectively more restrictive.

The way I have worked with punctuation, numerals, currency tries to reflect this. For scripts with many languages, there are defaults (stored in a special language default) that get imported into individual languages which extend it with additional characters if need be. For punctuation in Latin-script languages for example, quotes are the most variable. Instead of making a large default and importing it everywhere, I plan to have a default selection without any curly quotes or guillemets and import these per langue. Thus, maintaining some notion of recommendation. For Indian scripts, the defaults are set via Devanagari, the defaults include the basic English quotes ‘’“” which they all seem to use.

I have stayed away from the more fancy punctuation such as daggers, but included useful ones such as degree, prime, double prime. Legacy symbols (esp. currency) were avoided.

If there was a punctuation mark that seemed as nice-to-have for a particular language, I have put it in auxiliary.

A font-language check should not fail by not supporting punctuation, numerals, or currency by default.

Hopefully, someone with particular language expertise will review these additions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants