Support for punctuation in the data and detection #155

MrBrezina · 2024-01-16T09:58:24Z

I feel there is more flexibility in using punctuation (over numerals, see #154) and therefore it is trickier to give preference.

Punctuation could be:

required/preferred form (base), or
optional (auxiliary).

For example, «» may be preferred in French, but they can optionally be used in German. Maybe it is not worth collecting the optional punctuation at all (is it useful?).

The idea was to define general set of punctuation for the script and preference for the language. A language could inherit, extend, or completely override the script setting.

It is not clear to me whether these would be two records, e.g. punctuation and preferred punctuation (or even quote preference), or a single record. How is this related to design requirements/alternates?

It is also not clear what is still considered punctuation, or rather, useful punctuation to keep track of. Period, comma, quotes, parentheses, section sign, … are required to a different extent.

The text was updated successfully, but these errors were encountered:

MrBrezina · 2024-08-30T07:56:36Z

I will dump my thoughts here and then we can add some of it to README in a more concise form.

The reason for minimality of orthographies is that we want to minimize false negatives, i.e. font being reported as NOT support ing a language when in fact it can be used for that language. The other day I had someone suggesting that what Hyperglot is doing is in fact easy and can be done by looking at existing fonts that are known to support a particular language. The false negatives are exactly why we cannot do that. Similarly, this is also why dealing with conjuncts should not be done by way of expressing a complete “script grammar”, but rather by analysing actual texts. This also makes Hyperglot’s somewhat different from quality control which may want to stick to best practices that are effectively more restrictive.

The way I have worked with punctuation, numerals, currency tries to reflect this. For scripts with many languages, there are defaults (stored in a special language default) that get imported into individual languages which extend it with additional characters if need be. For punctuation in Latin-script languages for example, quotes are the most variable. Instead of making a large default and importing it everywhere, I plan to have a default selection without any curly quotes or guillemets and import these per langue. Thus, maintaining some notion of recommendation. For Indian scripts, the defaults are set via Devanagari, the defaults include the basic English quotes ‘’“” which they all seem to use.

I have stayed away from the more fancy punctuation such as daggers, but included useful ones such as degree, prime, double prime. Legacy symbols (esp. currency) were avoided.

If there was a punctuation mark that seemed as nice-to-have for a particular language, I have put it in auxiliary.

A font-language check should not fail by not supporting punctuation, numerals, or currency by default.

Hopefully, someone with particular language expertise will review these additions.

MrBrezina added the enhancement New feature or request label Jan 16, 2024

MrBrezina assigned MrBrezina and kontur Jan 16, 2024

MrBrezina mentioned this issue Apr 8, 2024

Clarify how to deal with other characters (symbols, currencies, punctuation) #60

Open

kontur added this to the 0.7.0 milestone May 31, 2024

This was referenced Aug 30, 2024

Support for numerals in the data and detection #154

Open

Notation for inline auxiliary for any kind of character set #177

Open

MrBrezina added a commit that referenced this issue Sep 3, 2024

Corrected quotations with respect to #155

afa0066

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for punctuation in the data and detection #155

Support for punctuation in the data and detection #155

MrBrezina commented Jan 16, 2024

MrBrezina commented Aug 30, 2024 •

edited

Loading

Support for punctuation in the data and detection #155

Support for punctuation in the data and detection #155

Comments

MrBrezina commented Jan 16, 2024

MrBrezina commented Aug 30, 2024 • edited Loading

MrBrezina commented Aug 30, 2024 •

edited

Loading