Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tlingit (iso 639-3 tli) #147

Merged
merged 5 commits into from
Dec 11, 2023
Merged

Add Tlingit (iso 639-3 tli) #147

merged 5 commits into from
Dec 11, 2023

Conversation

jcrippen
Copy link
Contributor

@jcrippen jcrippen commented Dec 3, 2023

Covers both current Tlingit orthographies.

Covers both current Tlingit orthographies.
@kontur
Copy link
Contributor

kontur commented Dec 6, 2023

Thanks @jcrippen for this very good PR, great to have linguists contribute data directly 👍

Before merging a few quick questions about the dual orthography. Looking at both orthographies being primary would indicate to the user that both are indeed used and both are indeed needed for a font to be supporting the language. Is this an accurate statement? Could we add a note (either to the language, or both orthographies) explaining what the differences between the two are?

Unless truly both are equally defacto standard I would prefer to a) not have both primary and/or b) not have them preferred_as_group. If they indeed are, that is okay, but we should indicate what their difference are and why both exist and are required. The more common use case for preferred_as_group has been languages where several scripts are considered required, so the overlap of two different Latin orthographies seems a different case, which for most times we have more accurately described by having only one primary and giving other orthographies other status.

@kontur kontur self-requested a review December 6, 2023 07:48
@kontur
Copy link
Contributor

kontur commented Dec 6, 2023

Please also add yourself to the CONTRIBUTORS.md :)

@moyogo
Copy link
Contributor

moyogo commented Dec 6, 2023

This is great!

According to Crippen 2019:839-840

[...] as of 2018 all Inland Tlingit communities and the YNLC have switched back to the Revised Popular orthography. Although no organizations continue to use the Leer orthography, some individuals still use it in unpublished materials such as notes, posters, business cards, and in social media. There are only two major publications that use this orthography: the previously mentioned narrative collection from Seidayáa Elizabeth Nyman (Nyman & Leer 1993) and the noun dictionary compiled by Jeff Leer (Leer, Hitch, & Ritter 2001).

@jcrippen Does that mean the Revised Popular orthography is the primary orthography?

It would be useful to indicate the first orthography is the Revised Popular and the second is Leer’s orthography, if I’m not mistaken.

I’m not sure the following note is useful or necessary as it applies to every decomposable character and whichever orthography they are used in. While it surely is a font or font shaper or input related issue occuring in this orthography, it is not specific to these character sequences. @kontur maybe hyperglot should automatically have a general note for when graphemes can be both composed or decomposed, instead of adding such a note to every single language where this does or can occur.

  • The <Ḵ> and <ḵ> may be either precomposed characters (U+1E34 and U+1E35) or the base characters and with the underscore diacritic (U+0331). Both should be supported as they may both be used in the same text. Same for the auxiliary <Ḻ>, <Ṉ>, etc.

@moyogo
Copy link
Contributor

moyogo commented Dec 6, 2023

Just wondering about Y̱ y̱ and Ɏ ɏ in the auxiliary, it seems that underlined letters were used for letters with underscore below (here U+0331 macron below), hence the underscore sometimes crossing the descender of g as noted. But in that case, if Ɏ ɏ are taken to represent some generic "y with stroke" here with stroke through descender, given Ǥ ǥ is currently a generic "g with stroke" often with a stroke through descender, then should either Ǥ ǥ be included in the auxiliary like Ɏ ɏ, or, the opposite, Ɏ ɏ be kept out of the auxiliary like Ǥ ǥ? It also seems the uppercase is inadequate for both letters Y and G as the stroke only crossed the descenders of the lowercase.

@kontur
Copy link
Contributor

kontur commented Dec 6, 2023

maybe hyperglot should automatically have a general note for when graphemes can be both composed or decomposed, instead of adding such a note to every single language where this does or can occur.

I think as design_requirement it is not needed unless either of the two would not be acceptable. Hyperglot has these options for checking:

-m, --marks: Flag to signal a font should also include all combining marks used for a language - by default only those marks are required which are not part of preencoded characters (default is False)
-d, --decomposed: Flag to signal a font should be considered supporting a language as long as it has all base glyphs and marks to write a language - by default also encoded precomposed glyphs are required (default is False)

So this is less specific to the language and more specific to the font and how stringently the user wants to test it.

All combining marks are implicitly required if a base + mark combination does not exist as encoded character, meaning the mark (and the base, of course) are always required to form the composite. --marks will additionally require any marks in other composites that have distinct unicodes assigned. The --decomposed flag further interacts with this by telling the checker encoded composites are not required if base + mark combinations can build all composites. So setting --marks will potentially yield narrower support, setting --decomposed will potentially yield wider support.

@moyogo
Copy link
Contributor

moyogo commented Dec 6, 2023

I misremembered seeing this only in Haida but it is in Tlingit: Keri Edwards, Dictionary of Tlingit, Juneau: Sealaska Heritage Institute, 2009 uses ɢ̱ (or what looks like it) for the lowercase of G̱ instead of g̱.
For example on p. 17:
Screenshot 2023-12-06 at 13 38 54

Is that too rare to be added to the auxiliary or mentionned in the note?

@jcrippen
Copy link
Contributor Author

jcrippen commented Dec 6, 2023

kontur:

Looking at both orthographies being primary would indicate to the user that both are indeed used and both are indeed needed for a font to be supporting the language. Is this an accurate statement? Could we add a note (either to the language, or both orthographies) explaining what the differences between the two are?

Unless truly both are equally defacto standard I would prefer to a) not have both primary and/or b) not have them preferred_as_group. If they indeed are, that is okay, but we should indicate what their difference are and why both exist and are required. The more common use case for preferred_as_group has been languages where several scripts are considered required, so the overlap of two different Latin orthographies seems a different case, which for most times we have more accurately described by having only one primary and giving other orthographies other status.

Even though no community officially uses the Leer orthography, it’s still in use by individuals in various contexts. I see the Leer orthography in use on Facebook, in community posters, and on business cards, for example. I’ve even seen them mixed together in a few cases. There are presumably a bunch of social and personal reasons for using one or the other.

Technically, supporting the Leer orthography is easier than the RP orthography because the diacritics are more common. All of ◌́ ◌̀ ◌̂ ◌̈ ◌̃ are used in western European orthographies and ◌̨ is used for Polish and Lithuanian. So if a font is intended to support these then it also supports the Leer orthography for free.

The real problem for in my experience is figuring out whether a font supports the RP orthography, particularly <◌̱> U+0331 and <Ḵḵ> U+1E34 and U+1E35. If it supports these then it most likely supports the ◌́ ◌̀ ◌̂ ◌̈ ◌̃ ◌̨ diacritics as well.

Maybe the best solution is preferred_as_group: true but status: secondary. That seems to fit with the discussion in README_database.md which says:

Orthographies with secondary status are ignored during language support detection, but used when detecting orthography support.

Then for language detection a font will need to minimally support <ḴḵX̱x̱G̱g̱>. If it also happens to support ◌̨ then the font can handle both orthographies.

moyogo:

It would be useful to indicate the first orthography is the Revised Popular and the second is Leer’s orthography, if I’m not mistaken.

That’s a good point, I’ll add a note to each of them giving names and some common synonyms as well as a quick summary of differences.

moyogo:

maybe hyperglot should automatically have a general note for when graphemes can be both composed or decomposed, instead of adding such a note to every single language where this does or can occur.

kontur:

I think as design_requirement it is not needed unless either of the two would not be acceptable.

I agree, and I’ll take out that point. I put it in because I’ve had exactly the problem of many fonts having <◌̱> U+0331 so that <X̱x̱G̱g̱> work fine but then lacking <Ḵḵ> U+1E34 and U+1E35 so that text mixing them (which is Unicode NFC compliant!) is broken. (The reverse where a font only has <Ḵḵ> and no <◌̱> U+0331 is far more common because apparently font designers often think in Unicode code blocks.)

I will remove the auxiliary <Ḵḵ> and the duplicate auxiliary <ḺḻṈṉ>. Also I’ll only use the precomposed characters.

moyogo:

Just wondering about Y̱ y̱ and Ɏ ɏ in the auxiliary, it seems that underlined letters were used for letters with underscore below (here U+0331 macron below), hence the underscore sometimes crossing the descender of g as noted. But in that case, if Ɏ ɏ are taken to represent some generic "y with stroke" here with stroke through descender, given Ǥ ǥ is currently a generic "g with stroke" often with a stroke through descender, then should either Ǥ ǥ be included in the auxiliary like Ɏ ɏ, or, the opposite, Ɏ ɏ be kept out of the auxiliary like Ǥ ǥ? It also seems the uppercase is inadequate for both letters Y and G as the stroke only crossed the descenders of the lowercase.

Upon reflection I think that both <Y̱y̱> and <Ɏɏ> should be excluded. They are limited to only a few documents and the <Ÿÿ> has completely replaced them in all cases. I’ve in fact seen both <Y̱y̱> and <Ɏɏ> replaced with <Ÿÿ> in digital text versions of older documents so there is already a precedent for this substitution. People who need them probably already need other specialized font stuff.

Having removed <Ɏɏ> there’s no need to be concerned with <Ǥǥ> which to my knowledge has not been used for Tlingit. Also, these both decompose to base + <◌̵> U+0335 COMBINING SHORT STROKE OVERLAY rather than to <◌̱> U+0331 COMBINING MACRON BELOW which is the decomposition of <Ḵḵ>. So they’re not logically a part of the <ḴḵX̱x̱G̱g̱> system and this is another point against including them.

moyogo:

I misremembered seeing this only in Haida but it is in Tlingit: Keri Edwards, Dictionary of Tlingit, Juneau: Sealaska Heritage Institute, 2009 uses ɢ̱ (or what looks like it) for the lowercase of G̱ instead of g̱.

Is that too rare to be added to the auxiliary or mentionned in the note?

As I recall, Edwards used <ɢ̱ > instead of <g̱> specifically because the available fonts did not support <g̱>. Nobody has done this since, so it’s best considered a stylistic variant rather than a distinct character. I recall seeing <ḡ> U+1E21 used for similar reasons (paired with <G̱> not <Ḡ> U+1E20).

All this trouble because it was easy to type backspace and _ on a typewriter.

Oh, I forgot to include the combinations of <◌̨> and <◌́> etc. I’ll add those to the auxiliary list for the Leer orthography.

One last thought: the website implies a way to make per-character design requirements or notes. Is this possible or planned? Because most of the design_requirements here are for specific characters.

See rosettatype#147 for discussion. Drop duplicates of non-precomposed <ḴḵḺḻṈṉ>. Clarifications to `design_requirements`. Add `note` for each orthography. Add combinations of ogonek and diacritics to `auxiliary` for Leer orthography. Change Leer orthography to `status: secondary` but keep `preferred_as_group: true`.
Add additional `design_requirement` point for U+0331 COMBINING MACRON BELOW. This applies to all uses of U+0331, not just those with base <g>.
@jcrippen
Copy link
Contributor Author

jcrippen commented Dec 6, 2023

Naturally I forgot to run hyperglot-validate. When I did it crashed with this traceback:

Malformed yaml:
mapping values are not allowed here
  in "/Users/james/Documents/Mine/hyperglot/lib/hyperglot/data/tli.yaml", line 15, column 188
Traceback (most recent call last):
  File "/Users/james/Library/Python/3.11/bin/hyperglot-validate", line 33, in <module>
    sys.exit(load_entry_point('hyperglot', 'console_scripts', 'hyperglot-validate')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/james/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/Library/Python/3.11/lib/python/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/Documents/Mine/hyperglot/lib/hyperglot/validate.py", line 358, in validate
    check_types(Langs)
  File "/Users/james/Documents/Mine/hyperglot/lib/hyperglot/validate.py", line 51, in check_types
    for iso, lang in Langs.items():
                     ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'items'

There’s no indication of what is wrong. But on a hunch I removed the note under each orthography and the crash went away. The README_database.md suggests that a note should be possible under orthographies, so this might be a bug.

The : makes the yaml parser choke.
@jcrippen
Copy link
Contributor Author

jcrippen commented Dec 6, 2023

Oh never mind, the problem is <:> in the contents of each note. Changed to <,> and it works.

@kontur
Copy link
Contributor

kontur commented Dec 7, 2023

This looks good to me 👍

As discussed here and in #149 Hyperglot doesn't distinguish between Composed or Decomposed forms in the data, but does so upon checking a font for language support. The issue of supporting either or both forms is, as I would see it, a concern of language detection parameters and not of encoding orthography data — in fact where a Composed form is possible we save it, implicitly noting the existence of it — or rather it would be significant if a Composed form were not encoded in unicode.

The note formatting works like this. You can also wrap values in single or double quotes in YAML to get around the colon issue, which is interpreted as YAML syntax otherwise.

@moyogo
Copy link
Contributor

moyogo commented Dec 7, 2023

@jcrippen thank you. It looks good too me too.

@kontur kontur merged commit 178505b into rosettatype:dev Dec 11, 2023
@kontur
Copy link
Contributor

kontur commented Dec 11, 2023

Super! Thanks @jcrippen — the new languages will be added to the next release 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants