Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split kerning by script, not by direction (second attempt) #667

Closed
wants to merge 107 commits into from

Conversation

madig
Copy link
Collaborator

@madig madig commented Oct 11, 2022

This pulls the commits for #636 and some follow-ups aside, to unblock releases from main until the code is more thoroughly tested.

Relevant issues:

Todo:

  • Track down the issue in RTL kerning missing from font #658
  • Do a bit of before-after compilation of single UFOs to compare their kerning and see if splitting by script introduces differences when it shouldn't.
  • Figure out what would need to be done to make variable fonts compile and behave correctly when not all kerning pairs are defined across all sources. The issue already existed before, but since we were splitting by direction, which is a much coarser property, it rarely was problematic. separate issue

Closes #658

Base automatically changed from revert-kern-splitter to main October 14, 2022 10:28
@madig
Copy link
Collaborator Author

madig commented Oct 17, 2022

While investigating #658, I stumbled over the following:

  1. The splitter stuffs foreign script kern pairs into lookups. E.g. lookup_Latn may suddenly contain both pos [Y ...] [A ...] and pos [Upsilon ...] [A ...] because the original pos @kern1.Y @kern2.A that contained their union was split up into the same lookup. I'd assume they should land in different ones?
  2. The lookup statements are unsorted, potentially triggering a bug in https://github.com/fonttools/fonttools/blob/64fd837ca1c2669f7e0e9d41f47b376e3fa153c3/Lib/fontTools/otlLib/builder.py#L1252-L1277, where a subtable break is inserted before a value for a class pair is declared but after the classdefs are associated with the subtable already, leading to a kerning value of zero entry being inserted in the earlier subtable (because https://github.com/fonttools/fonttools/blob/64fd837ca1c2669f7e0e9d41f47b376e3fa153c3/Lib/fontTools/otlLib/builder.py#L2131 is None, None), while the next one has the correct one, but is ignored by shapers. Additionally, sorting the statements manually shaved 10 KB off the font size on my test font source. Reverting to fontTools 4.37.1 (before Promote pair-pos value records fonttools/fonttools#2776) does not help.
  3. [otlLib] ClassPairPosSubtableBuilder logic is wrong fonttools/fonttools#2793 says there may be bigger problems lurking?!

@madig
Copy link
Collaborator Author

madig commented Oct 17, 2022

So, in the test font provided by yanone, the feature file does not declare languagesystem grek dflt; explicitly in the features.fea file, but the kern writer uses these declarations to sort kern pairs into script buckets. Greek glyphs are then considered script-less, which is wrong, and they get dumped into random other lookups. Uh. Should the writer even do this when kerning typically touches more scripts that GSUB? I'd derive the scripts in use purely from the glyph set? @simoncozens

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

Side-note: the code in

for flags, pairs in sorted(pairsByFlags.items()):
sorts the kerning pairs by type, but this is later undone in
def _makeSplitScriptKernLookups(
when splitting a pair can result in different types of kerning pairs. The sorting can probably be moved down to the split kern emitter.

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

From call:

  1. Split kerning pairs by script into buckets
  2. Use fontTools.misc.classifyTools within each bucket to make disjoint kern1 classes so that coverage per PairPos subtable is unique (to avoid potential fontTools otlLib subtable break bug).
    1. Collect all kern1 classes from the bucket, pass them to classifyTools
    2. Get back a mapping of {glyph: {disjoint set for that glyph}}
    3. For each "pos kern1 kern2 value;" in the bucket:
      1. smaller_kern1s = group_by(kern1, glyph => mapping[glyph])
      2. emit ["pos smaller_kern1 kern2 value;" for each smaller_kern1 in smaller_kern1s]
  3. Sort the whole bucket to have:
    • all the enum pos and glyph-to-glyph pos first, then the class-to-class pos second;
    • the "pos small_kern1" with the same kern1 next to each other

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

Also, the PairPos coverage must be unique inside a lookup, otherwise things are unreachable. That's an invariant to follow.

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

So, regarding #667 (comment), we either:

  1. Tell the designer to add a Greek languagesystem, or
  2. Insert the lookup under DFLT (if exists, otherwise error?), or
  3. Just use the script in the lookup even when the languagesystem is not declared.

Maybe I'll try the latter and see what happens in various shaping engines.

@anthrotype
Copy link
Member

anthrotype commented Oct 18, 2022

the feature file does not declare languagesystem grek dflt; explicitly in the features.fea file, but the kern writer uses these declarations to sort kern pairs into script buckets. Greek glyphs are then considered script-less, which is wrong, and they get dumped into random other lookups.

This thing of using the globally declared default languagesystems pre-dates the split-kerning-by-scripts PR, but maybe doesn't make sense any more and should be revised.

In the previous code in which we'd only split lookups between LTR/RTL, the default languagesystems were parsed and grouped into buckets depending on whether they were associated with "kern" or "dist", and on the script's horizontal writing direction (LTR/RTL), stored in self.context.scriptGroups dict of dicts (see _groupScriptsByTagAndDirection)

Then, if there was no globally declared script associated with "dist" feature, none would be generated.

Then, kerning pairs would be split into three lookups: LTR, RTL and a kern_dflt for pairs whose glyphs' script/bidi properties don't fit either LTR or RTL (e.g. punctuation, numbers, etc.). Finally, when deciding under which script/language/feature (kern vs dist) to register each of these lookups, again these script groupings derived from the global default languagesystems would be used (e.g. see _registerKernLookups). E.g. if kerning.plist contains pairs between LTR glyphs but default languagesystems only declares RTL scripts and no LTR ones, then the kern_ltr lookup would be registered under DFLT/dflt.

What we could have done maybe was to register such kern_ltr, not under DFLT/dflt, but under all the LTR scripts that are actually used therein e.g. by adding script grek; and such statements, even if they are not globally declared in the default languagesystems (which we know it's ok). I'm honestly not sure and can't remember why we didn't actually do that...
Maybe we didn't feel like we could safely infer stuff and preferred to leave it to the font developer to decide about.
Or maybe it was just simpler and OK enough that way.

Even so, imposing the font developer to write explicit default languagesystems to control the behavior of the kern feature writer may not always be desirable, as those apply indiscriminately to both GPOS and GSUB. In the example brought forward by Nikolaus, the GSUB features work fine without Greek being declared among the default languagesystems, but GPOS kerning may need those otherwise ufo2ft places greek kerning under DFLT/dflt; but if greek were added to default languagesystems then the meaning of all the pos/sub rules elsewhere that are un-marked by explicit script/language statements would suddenly also be registered for Greek too - though i guess they'd be no-op since no greek glyphs are probably used in them...

I'm warming up to the idea that for the split-by-script kern writer, we should just ignore the global languagesystem and use whatever scripts that we find among the kerned glyphs.

Thoughts @simoncozens @khaledhosny @madig ?

@simoncozens
Copy link
Contributor

That sounds good to me, the only issue being that we still need to look at the languagesystem to be aware of what shaper the font goes through, so that we can put kerns in kern or dist appropriately. So we still end up looking at the languagesystems...

@anthrotype
Copy link
Member

so that we can put kerns in kern or dist appropriately

once we know what scripts are used in kerning pairs, and have split these accordingly, we also know whether there are dist-enabled scripts or not

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

It seems that the new kern writer should be reworked a bit to adjust for our new assumptions. I'll try and draw up an algorithm to better wrap my mind around it.

Also, blindly using unicodedata.script_extension instead of

def knownScriptsPerCodepoint(self, uv):
if not self.context.knownScripts:
# If there are no languagesystems, consider everything common;
# it'll all end in DFLT/dflt anyway
return COMMON_SCRIPT
return [
x
for x in unicodedata.script_extension(chr(uv))
if x in self.context.knownScripts or x in DFLT_SCRIPTS
]
gives me additional lookup kern_Thaa and lookup kern_Yezi blocks for Arabic numerals :) I suppose they don't hurt, but are they useful?

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

@anthrotype what's the purpose of dealing with bases and marks in the kern writer?

@anthrotype
Copy link
Member

I am not sure we want to use script_extension.. that may add unwanted/unccessary script records. How about we extend the global default languagesystems with the set of scripts (not script_extensions!) actually used by the kerning pairs. You make a single kern_Arab lookup and also register the same under Thana or Yezi or whatever "extension" scripts
but only iff those are defined in the global languagesystems..

@anthrotype
Copy link
Member

aargh this is getting too complicated now, too much guessing. Maybe using the default languagesystems as we were/are is not so bad after all, at least it's explicit and leaves the responsibility to the font developer. 🤔

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

I think we should follow the guessing/inference approach until we hit an actual roadblock. I'll look into using the script rather than script_extension.

Also, I found another problem: enum pos em-cy [oslash.BRACKET.150 oslashacute.BRACKET.150] -10;. I think ufo2ft does not currently access Designspace rules for script discovery and the rules are tacked onto the binary not through features, but by writing into the final binary later in varLib. This particular line was in the feature output of a single UFO, so I think there's little we can do there. We probably need to pass on the info somehow at least when compiling from a Designspace, to avoid merging weirdness later?

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

Note to self: look at trying to preserve the kerning pair type (glyph to glyph, class to class, ...) when splitting pairs, by emitting eg pos [a] [b c] instead of enum pos a [b c].

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

You make a single kern_Arab lookup and also register the same under Thana or Yezi or whatever "extension" scripts

Maybe one could also classifyTools these specific multi-script pairs, group them into their own lookup and insert references in feature kern for each script 🤔

@madig
Copy link
Collaborator Author

madig commented Oct 18, 2022

Hm, if some codepoints are used in multiple scripts (e.g. Arabic numerals in Arabic and Thaana), is just storing them for Arabic gonna work? Are shapers going to run thaa and arab scripts? I think I'll keep the script_extensions for now and figure something out later.

@madig
Copy link
Collaborator Author

madig commented Oct 19, 2022

More thinking... since the BRACKET glyphs aren't reachable in the UFO, should the kern writer actually do a reachability analysis (look for Unicoded glyphs plus scour GSUB; additionally scan Designspace rules) and ignore glyphs that aren't reachable?

@anthrotype
Copy link
Member

feature writers work on ufos, they don't know about designspace..

@madig
Copy link
Collaborator Author

madig commented Oct 19, 2022

Hm. So, how do we pass in higher level information? Add new attribute additionalGlyphSubstitutions: dict[GlyphName, GlyphName to BaseFeatureCompiler, mirroring Designspace rules? That may not be enough though, because feature writers preceeding the kern writer could add substitutions and thereby influence script discovery. Or, well, such modifications work on the feature file, so compileGSUB would include them. So this may work?

@anthrotype
Copy link
Member

anthrotype commented Oct 19, 2022

it might work but feels iffy

@anthrotype
Copy link
Member

i'm thinking we should provide a way for users to override ufo2ft's inference of unicode properties such as script/bidi category based on cmap+GSUB for specific glyphs, we might need that anyway to support Glyphs.app's RTL kerning #658 (comment)

@madig
Copy link
Collaborator Author

madig commented Oct 19, 2022

So, what do we want to be able to pass in? A glyphUnicodeMapping: dict[str, int] | None where the latter would trigger inference and the former is considered complete?

@madig
Copy link
Collaborator Author

madig commented Oct 19, 2022

Access to the Designspace is solved with a new VariableFeatureCompiler in https://github.com/googlefonts/ufo2ft/pull/635/files#diff-7d78e1bc9030aa197442bd5c7eca11381272a8c20f3467bbd7bea27d12bc17eeR355-R393. Having a separate glyphUnicodeMapping might still be useful.

@anthrotype
Copy link
Member

Having a separate glyphUnicodeMapping might still be useful.

ok let's do that in a different pr though

@madig
Copy link
Collaborator Author

madig commented Oct 19, 2022

I believe my WIP does what we need, all class-to-class pairings now seem to have a disjoint kern1 classes I believe, from looking at output for yanone's private real world font right here.

I did spot two things though in an excerpt, one weird (two kern2 classes for the same kern1 parent that were almost identical except for one additional glyph???) and the other a possible size optimization opportunity. Consider the following class to class pairings:

  • [A B C] [a b] -10
  • [A B C] [c d] -10
  • [A B C] [e f] -20

can we collapse them into the following?

  • [A B C] [a b c d] -10
  • [A B C] [e f] -20

Or is this an optimization that only the otlLib subtable builder could do because kern2s have to be disjoint per subtable?

@madig madig closed this Feb 20, 2023
@khaledhosny khaledhosny deleted the kern-splitter-v2 branch March 8, 2023 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RTL kerning missing from font
3 participants