Arabic shaping definitions #39

yanone · 2023-12-08T11:01:24Z

I accidentally pushed everything in one commit; I wanted to separate it.
Please take a look at this. It works for my own font for "ar_Arab".

I would still change a few things:

* FAIL: Shaper didn't attach uni0670 to space: The no_orphaned_marks check needs an exception for a preceding space, like there is one already for the dotted circle. These are the Arabic marks that are defined in gflanguages in base characters, separated by spaces.
The manually defined marks sequences aren't complete. I defined them manually because they contain a few pitfalls like people typing two consecutive fatha that should get replaced to a fathatan. (Which I should also add to shaping_differs) A few more combinations are possibly missing here. I would keep these manually defined ones but prune them to the marks actually defined in each language definition, and also automatically add the defined ones to that list, basically combine the mark sequences defined in the Python script plus the ones defined in the lang def, then prune to those actually defined in the lang def. Sorry for the complicated language.
I would also add a shaping_identical check where I can check that fatha+shadda (etc.) shapes to the same result as shadda+fatha, because people type them in arbitrary order.

This is how far I got. Just wanted to hear your opinion first before I continue.

Two existing scripts got reformatted by Black because I renamed the data folder. I could change that back if you like.

NeilSureshPatel · 2023-12-08T16:35:03Z

@yanone these are interesting use cases. A few thoughts on the various points.

Regarding the exceptions for spaces preceding diacritics, I wonder if this should be fixed in gflanguages. It seems that using the dotted circle before diacritics is the norm for all other profiles.

I think @simoncozens would know better but I believe that changing two fathas in sequence is not canonically equivalent to a fathathan in Unicode. I have implemented similar functions in fonts but, I think in principle this is something that is supposed to be handled by the input method and not by OpenType substitutions.

@simoncozens for the shadda+fatha sequence is this something that is normalized in Unicode. I can't recall if normalization is processed in shaperglot before testing but I would guess harfbuzz would do that automatically.

Kamal-Mansour · 2023-12-08T19:06:04Z

At the level of Unicode characters, two fathas are certainly not canonically equivalent to fathatan. Similarly for dammatan and kasratan. All three indicate case for a word and are known as "nunation" marks (or, tanween) because they combine a short vowel [a, i, or u] with a final nun (-an, -in, -un). Nunation marks appear only at the end of a word, though some people prefer to place the fathatan on the penultimate letter before the final alef.

khaledhosny · 2023-12-09T07:50:23Z

I would also add a shaping_identical check where I can check that fatha+shadda (etc.) shapes to the same result as shadda+fatha, because people type them in arbitrary order.

HarfBuzz will reorder then anyway (following https://www.unicode.org/reports/tr53/), so a check using HarfBuzz for shaping will not catch whether the font does anything here or not.

khaledhosny · 2023-12-09T07:58:20Z

The manually defined marks sequences aren't complete. I defined them manually because they contain a few pitfalls like people typing two consecutive fatha that should get replaced to a fathatan.

Two consecutive fathas should certainly not be replaced by fathatan. Some fonts used to allow two fathas, dammas, and kasras for open fathatan, dammadan, and kasratan (U+08F0–08F2) before they were added to Unicode, but Uniscribe does not allow this (it will insert a dotted circle between the marks) and Unicode elected to atomically encode them instead.

yanone · 2023-12-09T11:59:02Z

HarfBuzz will reorder then anyway (following https://www.unicode.org/reports/tr53/), so a check using HarfBuzz for shaping will not catch whether the font does anything here or not.

That's true and very frustrating. I just recompiled my font without the font-level reordering and confirmed that Indesign 2023 (18.2.1) doesn't reorder them on its own, which is what I always thought anyway.

Otherwise, I'll take everyone's advice here and return with newer results in a little while.

…ly defined ones

yanone · 2023-12-14T14:05:42Z

I’ve updated the Arabic definition assembler regarding the marks: It now regularly checks only for the non-spacing marks that are actually defined for each language, plus a few extra cases that are manually defined per language inside the build script, such as fatha+shadda and kasra+shadda for ar_Arab, to check for a functioning mkmk feature.

Please feel free to suggest more combinations. I thought having these will suggest that all mkmk cases are working, but it might be good to be exhaustive here, now that we’re already doing the work. I’m not familiar with languages other than Arabic.

In parallel, I have a draft PR at the language repo where I added the dotted circle to all marks (among other changes).

I don't see any need for further changes at this point, so removing the "draft" status. Please review.

NeilSureshPatel · 2023-12-14T14:24:14Z

I like the mkmk test. I am super familiar with the other languages but this will be a good framework for African languages which use extensive vocalization.

simoncozens · 2023-12-14T14:47:51Z

This looks nice. I'm going to test it with a bunch of fonts and see what it says about them.

simoncozens · 2023-12-14T15:00:38Z

Two quick hits:

This also needs your lang PR (Arabic and Pashto improvements lang#136)
strictyaml, at least version 1.7.3, takes 13 seconds on a moderately powerful machine to parse that YAML file. I think we need to reconsider our commitment to strictyaml...

yanone · 2023-12-14T15:44:52Z

2. I think we need to reconsider our commitment to strictyaml

This brings me back to what I mentioned to you in Chat the other day:

So far, all definitions in shaperglot are generated by a script. This makes me question why the existing definitions even exist, and why the checks aren't executed live instead. And the definitions don’t even include c2sc yet for Latin, only smcp, and only for African Latin languages, not even European. All this data is incomplete and unnecessary so far.

Running the checks live would:

Save the yaml loading time
Save disk space for the definitions. The folder is 3.3MB right now, and it's not even complete for Latin as outlined above. Where is this gonna go?
Eliminate the need to regenerate the definitions every time there are changes in gflanguages. This is a biggy IMO, because it would mean that whenever the language definitions are updated, one would only have to update the dependency of gflanguages in fontbakery for example, and could leave shaperglot alone entirely. Otherwise it would mean to recompile the shaperglot definitions for even the tiniest change in gflanguages, and update both dependencies in all apps. This is inefficient.

I still think manual definitions should exist for cases like the Dutch IJ situation. But I would eliminate definitions for all auto-generated checks. Maybe not as part of this PR.

kontur · 2023-12-15T07:30:12Z

Comment from the sidelines... I agree with @yanone's last comment. What if the checks were defined in the library as arbitrary segments relating to a script or typographic feature that would have ranges of unicodes (of the language) that trigger them, and per-language yaml definitions would tell the checker what segments of checks to opt-in to for this language (e.g. check Arabic shaping, check Latin smallcaps, check fractions...), and the unicodes of each language file would trigger the specific checks for glyph combinations? You'd still have quite some fine grained control over the check definitions, and the check suites would be more composable on a per-language level. Less brute force repetition, less overhead.

simoncozens · 2023-12-15T09:40:51Z

I think it's a good idea, and I remember that I initially suggested it to Neil when he was working on the African profiles - it seemed a bit odd to have a script pre-generating rules, then store those rules in the repo, and load in the file, instead of just generating the rules at runtime. At the time he was building profiles out of the source repo and it was the easiest way for him to develop and test stuff, so we went with that workflow at the time. I will work on a way for people to dump profileprovider modules into a directory and have shaperglot run through those and use them to inject checks into one or more language profiles.

yanone · 2023-12-15T11:56:53Z

I made one minor change after a comment of @moyogo on the lang PR:

Over there I moved the Arabic marks into its own marks category, and over here I take those marks from the marks category only. This will force us to cleanly separate the marks from the base for future language definition updates.

We seem to mostly agree to convert checks to run-time. Do we still want to merge this PR right now so that at least this step is achieved? I would like that to be able to check it off my list. I assume Simon would want to restructure the code to provide the checks separate of this PR, and the code that generated the Arabic checks is sitting right there ready to be converted.

simoncozens · 2023-12-15T12:02:49Z

Sounds good to me. I can clean up later.

NeilSureshPatel · 2023-12-15T12:57:36Z

I think it's a good idea, and I remember that I initially suggested it to Neil when he was working on the African profiles - it seemed a bit odd to have a script pre-generating rules, then store those rules in the repo, and load in the file, instead of just generating the rules at runtime. At the time he was building profiles out of the source repo and it was the easiest way for him to develop and test stuff, so we went with that workflow at the time. I will work on a way for people to dump profileprovider modules into a directory and have shaperglot run through those and use them to inject checks into one or more language profiles.

I agree. It's a bit clunky. At the time we were shifting from rules that were built around checking cluster positions to ones built around failure modes. During that phase it was easier to figure out what kinds of tests would work and how to structure them by building them manually. Now that we have figured out what we need and have more use cases, it makes sense to move to run-time checks.

yanone added 2 commits December 8, 2023 11:50

Renamed folder to just "data"

952cfd5

Removed unused code copied from other file

87c48b8

yanone added 3 commits December 14, 2023 14:51

Check only for non-spacing marks defined in language, plus few manual…

6daa997

…ly defined ones

Update sg-bulk-builder-arabic.py

6a855b7

Added additional mark combinations

1504485

yanone marked this pull request as ready for review December 14, 2023 14:05

yanone requested a review from simoncozens December 14, 2023 14:05

Take marks from "marks" category only

9bc4dd1

simoncozens merged commit da7c52d into main Dec 15, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic shaping definitions #39

Arabic shaping definitions #39

yanone commented Dec 8, 2023 •

edited

Loading

NeilSureshPatel commented Dec 8, 2023 •

edited

Loading

Kamal-Mansour commented Dec 8, 2023

khaledhosny commented Dec 9, 2023

khaledhosny commented Dec 9, 2023 •

edited

Loading

yanone commented Dec 9, 2023

yanone commented Dec 14, 2023 •

edited

Loading

NeilSureshPatel commented Dec 14, 2023

simoncozens commented Dec 14, 2023

simoncozens commented Dec 14, 2023

yanone commented Dec 14, 2023 •

edited

Loading

kontur commented Dec 15, 2023 •

edited

Loading

simoncozens commented Dec 15, 2023

yanone commented Dec 15, 2023

simoncozens commented Dec 15, 2023

NeilSureshPatel commented Dec 15, 2023

Arabic shaping definitions #39

Arabic shaping definitions #39

Conversation

yanone commented Dec 8, 2023 • edited Loading

NeilSureshPatel commented Dec 8, 2023 • edited Loading

Kamal-Mansour commented Dec 8, 2023

khaledhosny commented Dec 9, 2023

khaledhosny commented Dec 9, 2023 • edited Loading

yanone commented Dec 9, 2023

yanone commented Dec 14, 2023 • edited Loading

NeilSureshPatel commented Dec 14, 2023

simoncozens commented Dec 14, 2023

simoncozens commented Dec 14, 2023

yanone commented Dec 14, 2023 • edited Loading

kontur commented Dec 15, 2023 • edited Loading

simoncozens commented Dec 15, 2023

yanone commented Dec 15, 2023

simoncozens commented Dec 15, 2023

NeilSureshPatel commented Dec 15, 2023

yanone commented Dec 8, 2023 •

edited

Loading

NeilSureshPatel commented Dec 8, 2023 •

edited

Loading

khaledhosny commented Dec 9, 2023 •

edited

Loading

yanone commented Dec 14, 2023 •

edited

Loading

yanone commented Dec 14, 2023 •

edited

Loading

kontur commented Dec 15, 2023 •

edited

Loading