-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic shaping definitions #39
Conversation
@yanone these are interesting use cases. A few thoughts on the various points. Regarding the exceptions for spaces preceding diacritics, I wonder if this should be fixed in gflanguages. It seems that using the dotted circle before diacritics is the norm for all other profiles. I think @simoncozens would know better but I believe that changing two @simoncozens for the |
At the level of Unicode characters, two fathas are certainly not canonically equivalent to fathatan. Similarly for dammatan and kasratan. All three indicate case for a word and are known as "nunation" marks (or, tanween) because they combine a short vowel [a, i, or u] with a final nun (-an, -in, -un). Nunation marks appear only at the end of a word, though some people prefer to place the fathatan on the penultimate letter before the final alef. |
HarfBuzz will reorder then anyway (following https://www.unicode.org/reports/tr53/), so a check using HarfBuzz for shaping will not catch whether the font does anything here or not. |
Two consecutive fathas should certainly not be replaced by fathatan. Some fonts used to allow two fathas, dammas, and kasras for open fathatan, dammadan, and kasratan (U+08F0–08F2) before they were added to Unicode, but Uniscribe does not allow this (it will insert a dotted circle between the marks) and Unicode elected to atomically encode them instead. |
That's true and very frustrating. I just recompiled my font without the font-level reordering and confirmed that Indesign 2023 (18.2.1) doesn't reorder them on its own, which is what I always thought anyway. Otherwise, I'll take everyone's advice here and return with newer results in a little while. |
I’ve updated the Arabic definition assembler regarding the marks: It now regularly checks only for the non-spacing marks that are actually defined for each language, plus a few extra cases that are manually defined per language inside the build script, such as Please feel free to suggest more combinations. I thought having these will suggest that all In parallel, I have a draft PR at the language repo where I added the dotted circle to all marks (among other changes). I don't see any need for further changes at this point, so removing the "draft" status. Please review. |
I like the |
This looks nice. I'm going to test it with a bunch of fonts and see what it says about them. |
Two quick hits:
|
This brings me back to what I mentioned to you in Chat the other day: So far, all definitions in Running the checks live would:
I still think manual definitions should exist for cases like the Dutch IJ situation. But I would eliminate definitions for all auto-generated checks. Maybe not as part of this PR. |
Comment from the sidelines... I agree with @yanone's last comment. What if the checks were defined in the library as arbitrary segments relating to a script or typographic feature that would have ranges of unicodes (of the language) that trigger them, and per-language yaml definitions would tell the checker what segments of checks to opt-in to for this language (e.g. check Arabic shaping, check Latin smallcaps, check fractions...), and the unicodes of each language file would trigger the specific checks for glyph combinations? You'd still have quite some fine grained control over the check definitions, and the check suites would be more composable on a per-language level. Less brute force repetition, less overhead. |
I think it's a good idea, and I remember that I initially suggested it to Neil when he was working on the African profiles - it seemed a bit odd to have a script pre-generating rules, then store those rules in the repo, and load in the file, instead of just generating the rules at runtime. At the time he was building profiles out of the source repo and it was the easiest way for him to develop and test stuff, so we went with that workflow at the time. I will work on a way for people to dump |
I made one minor change after a comment of @moyogo on the lang PR: Over there I moved the Arabic marks into its own We seem to mostly agree to convert checks to run-time. Do we still want to merge this PR right now so that at least this step is achieved? I would like that to be able to check it off my list. I assume Simon would want to restructure the code to provide the checks separate of this PR, and the code that generated the Arabic checks is sitting right there ready to be converted. |
Sounds good to me. I can clean up later. |
I agree. It's a bit clunky. At the time we were shifting from rules that were built around checking cluster positions to ones built around failure modes. During that phase it was easier to figure out what kinds of tests would work and how to structure them by building them manually. Now that we have figured out what we need and have more use cases, it makes sense to move to run-time checks. |
I accidentally pushed everything in one commit; I wanted to separate it.
Please take a look at this. It works for my own font for "ar_Arab".
I would still change a few things:
* FAIL: Shaper didn't attach uni0670 to space
: Theno_orphaned_marks
check needs an exception for a preceding space, like there is one already for the dotted circle. These are the Arabic marks that are defined in gflanguages in base characters, separated by spaces.fatha
that should get replaced to afathatan
. (Which I should also add toshaping_differs
) A few more combinations are possibly missing here. I would keep these manually defined ones but prune them to the marks actually defined in each language definition, and also automatically add the defined ones to that list, basically combine the mark sequences defined in the Python script plus the ones defined in the lang def, then prune to those actually defined in the lang def. Sorry for the complicated language.shaping_identical
check where I can check thatfatha+shadda
(etc.) shapes to the same result asshadda+fatha
, because people type them in arbitrary order.This is how far I got. Just wanted to hear your opinion first before I continue.
Two existing scripts got reformatted by Black because I renamed the data folder. I could change that back if you like.