Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic shaping definitions #39

Merged
merged 6 commits into from
Dec 15, 2023
Merged

Arabic shaping definitions #39

merged 6 commits into from
Dec 15, 2023

Conversation

yanone
Copy link
Contributor

@yanone yanone commented Dec 8, 2023

I accidentally pushed everything in one commit; I wanted to separate it.
Please take a look at this. It works for my own font for "ar_Arab".

I would still change a few things:

  • * FAIL: Shaper didn't attach uni0670 to space: The no_orphaned_marks check needs an exception for a preceding space, like there is one already for the dotted circle. These are the Arabic marks that are defined in gflanguages in base characters, separated by spaces.
  • The manually defined marks sequences aren't complete. I defined them manually because they contain a few pitfalls like people typing two consecutive fatha that should get replaced to a fathatan. (Which I should also add to shaping_differs) A few more combinations are possibly missing here. I would keep these manually defined ones but prune them to the marks actually defined in each language definition, and also automatically add the defined ones to that list, basically combine the mark sequences defined in the Python script plus the ones defined in the lang def, then prune to those actually defined in the lang def. Sorry for the complicated language.
  • I would also add a shaping_identical check where I can check that fatha+shadda (etc.) shapes to the same result as shadda+fatha, because people type them in arbitrary order.

This is how far I got. Just wanted to hear your opinion first before I continue.

Two existing scripts got reformatted by Black because I renamed the data folder. I could change that back if you like.

@NeilSureshPatel
Copy link
Contributor

NeilSureshPatel commented Dec 8, 2023

@yanone these are interesting use cases. A few thoughts on the various points.

Regarding the exceptions for spaces preceding diacritics, I wonder if this should be fixed in gflanguages. It seems that using the dotted circle before diacritics is the norm for all other profiles.

I think @simoncozens would know better but I believe that changing two fathas in sequence is not canonically equivalent to a fathathan in Unicode. I have implemented similar functions in fonts but, I think in principle this is something that is supposed to be handled by the input method and not by OpenType substitutions.

@simoncozens for the shadda+fatha sequence is this something that is normalized in Unicode. I can't recall if normalization is processed in shaperglot before testing but I would guess harfbuzz would do that automatically.

@Kamal-Mansour
Copy link

At the level of Unicode characters, two fathas are certainly not canonically equivalent to fathatan. Similarly for dammatan and kasratan. All three indicate case for a word and are known as "nunation" marks (or, tanween) because they combine a short vowel [a, i, or u] with a final nun (-an, -in, -un). Nunation marks appear only at the end of a word, though some people prefer to place the fathatan on the penultimate letter before the final alef.

@khaledhosny
Copy link

  • I would also add a shaping_identical check where I can check that fatha+shadda (etc.) shapes to the same result as shadda+fatha, because people type them in arbitrary order.

HarfBuzz will reorder then anyway (following https://www.unicode.org/reports/tr53/), so a check using HarfBuzz for shaping will not catch whether the font does anything here or not.

@khaledhosny
Copy link

khaledhosny commented Dec 9, 2023

  • The manually defined marks sequences aren't complete. I defined them manually because they contain a few pitfalls like people typing two consecutive fatha that should get replaced to a fathatan.

Two consecutive fathas should certainly not be replaced by fathatan. Some fonts used to allow two fathas, dammas, and kasras for open fathatan, dammadan, and kasratan (U+08F0–08F2) before they were added to Unicode, but Uniscribe does not allow this (it will insert a dotted circle between the marks) and Unicode elected to atomically encode them instead.

@yanone
Copy link
Contributor Author

yanone commented Dec 9, 2023

HarfBuzz will reorder then anyway (following https://www.unicode.org/reports/tr53/), so a check using HarfBuzz for shaping will not catch whether the font does anything here or not.

That's true and very frustrating. I just recompiled my font without the font-level reordering and confirmed that Indesign 2023 (18.2.1) doesn't reorder them on its own, which is what I always thought anyway.

Otherwise, I'll take everyone's advice here and return with newer results in a little while.

@yanone
Copy link
Contributor Author

yanone commented Dec 14, 2023

I’ve updated the Arabic definition assembler regarding the marks: It now regularly checks only for the non-spacing marks that are actually defined for each language, plus a few extra cases that are manually defined per language inside the build script, such as fatha+shadda and kasra+shadda for ar_Arab, to check for a functioning mkmk feature.

Please feel free to suggest more combinations. I thought having these will suggest that all mkmk cases are working, but it might be good to be exhaustive here, now that we’re already doing the work. I’m not familiar with languages other than Arabic.

In parallel, I have a draft PR at the language repo where I added the dotted circle to all marks (among other changes).

I don't see any need for further changes at this point, so removing the "draft" status. Please review.

@yanone yanone marked this pull request as ready for review December 14, 2023 14:05
@NeilSureshPatel
Copy link
Contributor

I like the mkmk test. I am super familiar with the other languages but this will be a good framework for African languages which use extensive vocalization.

@simoncozens
Copy link
Collaborator

This looks nice. I'm going to test it with a bunch of fonts and see what it says about them.

@simoncozens
Copy link
Collaborator

Two quick hits:

  1. This also needs your lang PR (Arabic and Pashto improvements lang#136)
  2. strictyaml, at least version 1.7.3, takes 13 seconds on a moderately powerful machine to parse that YAML file. I think we need to reconsider our commitment to strictyaml...

@yanone
Copy link
Contributor Author

yanone commented Dec 14, 2023

2. I think we need to reconsider our commitment to strictyaml

This brings me back to what I mentioned to you in Chat the other day:

So far, all definitions in shaperglot are generated by a script. This makes me question why the existing definitions even exist, and why the checks aren't executed live instead. And the definitions don’t even include c2sc yet for Latin, only smcp, and only for African Latin languages, not even European. All this data is incomplete and unnecessary so far.

Running the checks live would:

  • Save the yaml loading time
  • Save disk space for the definitions. The folder is 3.3MB right now, and it's not even complete for Latin as outlined above. Where is this gonna go?
  • Eliminate the need to regenerate the definitions every time there are changes in gflanguages. This is a biggy IMO, because it would mean that whenever the language definitions are updated, one would only have to update the dependency of gflanguages in fontbakery for example, and could leave shaperglot alone entirely. Otherwise it would mean to recompile the shaperglot definitions for even the tiniest change in gflanguages, and update both dependencies in all apps. This is inefficient.

I still think manual definitions should exist for cases like the Dutch IJ situation. But I would eliminate definitions for all auto-generated checks. Maybe not as part of this PR.

@kontur
Copy link

kontur commented Dec 15, 2023

Comment from the sidelines... I agree with @yanone's last comment. What if the checks were defined in the library as arbitrary segments relating to a script or typographic feature that would have ranges of unicodes (of the language) that trigger them, and per-language yaml definitions would tell the checker what segments of checks to opt-in to for this language (e.g. check Arabic shaping, check Latin smallcaps, check fractions...), and the unicodes of each language file would trigger the specific checks for glyph combinations? You'd still have quite some fine grained control over the check definitions, and the check suites would be more composable on a per-language level. Less brute force repetition, less overhead.

@simoncozens
Copy link
Collaborator

I think it's a good idea, and I remember that I initially suggested it to Neil when he was working on the African profiles - it seemed a bit odd to have a script pre-generating rules, then store those rules in the repo, and load in the file, instead of just generating the rules at runtime. At the time he was building profiles out of the source repo and it was the easiest way for him to develop and test stuff, so we went with that workflow at the time. I will work on a way for people to dump profileprovider modules into a directory and have shaperglot run through those and use them to inject checks into one or more language profiles.

@yanone
Copy link
Contributor Author

yanone commented Dec 15, 2023

I made one minor change after a comment of @moyogo on the lang PR:

Over there I moved the Arabic marks into its own marks category, and over here I take those marks from the marks category only. This will force us to cleanly separate the marks from the base for future language definition updates.

We seem to mostly agree to convert checks to run-time. Do we still want to merge this PR right now so that at least this step is achieved? I would like that to be able to check it off my list. I assume Simon would want to restructure the code to provide the checks separate of this PR, and the code that generated the Arabic checks is sitting right there ready to be converted.

@simoncozens
Copy link
Collaborator

Sounds good to me. I can clean up later.

@simoncozens simoncozens merged commit da7c52d into main Dec 15, 2023
4 checks passed
@NeilSureshPatel
Copy link
Contributor

I think it's a good idea, and I remember that I initially suggested it to Neil when he was working on the African profiles - it seemed a bit odd to have a script pre-generating rules, then store those rules in the repo, and load in the file, instead of just generating the rules at runtime. At the time he was building profiles out of the source repo and it was the easiest way for him to develop and test stuff, so we went with that workflow at the time. I will work on a way for people to dump profileprovider modules into a directory and have shaperglot run through those and use them to inject checks into one or more language profiles.

I agree. It's a bit clunky. At the time we were shifting from rules that were built around checking cluster positions to ones built around failure modes. During that phase it was easier to figure out what kinds of tests would work and how to structure them by building them manually. Now that we have figured out what we need and have more use cases, it makes sense to move to run-time checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants