-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CJK ExtensionB, etc. #13
Comments
Copied assignee over from the original issue. This is a request, essentially, for coverage of every single Unicode CJK point, which as Ken points out would take some time, so could be considered a feature request. But the submitter points out that 'Noto' means 'no tofu', so arguably it's a defect that we don't cover all of CJK even as defined in Unicode 3.1. |
This is working-as-intended for the current stage of the project. We do cover a small subset ( < 3,000) of Plane 2 characters (CJK Ext B, C, D) as used in Hong Kong, included in JIS X 213 and so forth. However, the full coverage of Plane 2 CJK Ideographs is beyond the scope of our current Pan CJK fonts. We're keenly aware of the lack of the coverage for them. At this point, when and whether we'll cover them is still undecided. |
There are 60 characters in the extension planes that are official Taiwanese (Southern Min Chinese) characters as designated by the Taiwanese Ministry of Education and in their comprehensive dictionary that are in Unicode and not in any font that I know of (except for Hanizono). It would be great if these were in Noto Sans, as I do a lot of work in Taiwanese language materials and these characters always look different since they always show in Hanizono although I prefer to use Noto. There are also 7 characters missing from the Unicode spec, but I'm not sure how one would go about getting those added to the spec before adding them to fonts. Here are the lists: In Unicode but missing from Noto (download Hanizono font HanaMinB to see):
The character missing from HanaMinB, 𬦰, is 足百 put together, it exists in the Unicode spec but not in any fonts that I'm aware of. Missing from Unicode spec (join radicals left to right as single character):
Would be awesome if Noto CJK could include these characters, or at least the 60 that already have designated code points! I would be happy to make the vector images for these characters based on existing Noto CJK characters, if that would make it easier/faster to get them included in the next release. |
Is it possible to create additional fonts to give support for such characters? Hanazono Mincho did it. |
@dougfelt: Here's my response: Apparently, @anyong didn't look hard enough for the ideographs that were reported as unencoded, because all of them, except one that is in the process of being encoded, were lurking in Extension B: ⿰率刂 → U+207A9 𠞩 (Extension B) As to the list of 60 that were reported as missing from Noto Sans CJK, two of them are actually included (because they are within the scope of Hong Kong SCS-2008): U+25565 𥕥 (Extension B) 51 (including the two above) are in Extension B, one is in Extension C, seven are in Extension D, and one is in Extension E. In any case, being included in a dictionary may certainly qualify for encoding a character, particularly if it is a head entry. However, these characters are outside the current scope of the character set. The highest-priority issue for the forthcoming Version 2.000 update is to support Hong Kong SCS-2016 both in terms of characters and appropriate HK glyphs, and the next highest-priority issue is to support the additional ideographs for Korean names in Issue #80. Anything beyond that will depend on whether there are available CIDs, and will need to be considered on a case-by-case basis. There also needs to be mutual agreement between Adobe and Google, because ultimately someone will need to pay $¥₩ to have additional glyphs designed. |
ref: https://github.com/ethantw/moedict-idc/blob/master/cjk-ext.tsv#L103 |
@audreyt: Yep. Will edit my response above, from unencoded to U+24062 𤁢. |
Closing in on 2 years later, any possibility of getting these into Noto CJK? These characters not being available, especially on Android phones, is a major hindrance to writing Taiwanese in everyday conversation and thus further development of the sense of "normalcy" around it. Some characters listed above such as 𠢕 ("to be good at sth") have a very high frequency in Taiwanese, and are often replaced by similar sounding but meaningless-in-context characters such as 猴 (monkey). |
@anyong The short answer is: Don't hold your breath. The long answer is: Extension B is the proverbial "800-pound 🦍" when it comes to CJK Unified Ideographs. In related news, if Taiwan's standards weren't such a mess, it'd be possible to determine which characters beyond CNS 11643 Planes 1 and 2 (aka Big Five) are frequently used. While we eventually plan to support Extension B and beyond, it will take years and possibly a decade or more to achieve that goal. The image below puts Extension B into perspective, in terms of how much of Plane 2 (aka SIP) it occupies: |
@kenlunde I can certainly appreciate the problems presented by the size of Ext. B. On the other hand, I'm asking about a few dozen characters in particular that are defined by the Taiwanese Ministry of Education as part of the Taiwanese orthography. Certainly it should be possible to prioritize some small subsets over the entirety of Ext. B. Could you explain what you mean by Taiwan's standards being "such a mess"? I am sure @audreyt could help resolve any issues to get a definitive answer on what should be prioritized. At the very least, the characters in the MOE's set of frequently used Taiwanese characters should be included as they are indeed frequently used. |
@anyong See Source Han Sans Issue #222. In other words, the dot-release that is currently targeted for mid-April will not include glyph for these characters, but they will be considered for the next major release. |
@kenlunde thanks! If there is anything I could do to help, I would be happy to. |
@anyong As long as the listing in that Source Han Sans issue is accurate, there's nothing further that you need to do. |
Thanks for discussing this @anyong - I'm keeping this issue open, as it seems it will eventually end up in a future release. |
@davelab6 The reasons why I suggested closing this issue is because 1) supporting additional CJK Unified Ideographs is so blatantly obvious; and 2) supporting Extension B and beyond in their entirety will take many years. |
Is this publicly documented anywhere? :) If not, lets chat about this next time we speak privately, I'd like to better understand any loose ideas about the roadmap |
@davelab6 Affirmative. See Source Han Sans Issue 222. The reason why I suggest that this issue be closed is because it will take years or decades to fully support the CJK Unified Ideograph extensions. The graphic that I posted on 2019-01-12 illustrates this extraordinarily well. |
According to the @kenlunde 's image, it's a bunch of work! 😮 When you have to render these characters, use other fonts (such as MS Gothic) for now. These are the huge numbers of characters to supportThere are: Altogether, there are 72397 characters to design! That's not all: Unicode is growing year by year, and the numbers above will grow toghether with Unicode. |
The numbers are even larger. Many of these characters need separate traditional and simplified glyph forms, I imagine (and that's what the issue suggests to which @kenlunde referred). |
If you need a free font to fill the tofu blanks, you may consider https://github.com/cjkvi/HanaMinAFDKO/releases which is built from http://en.glyphwiki.org/wiki/GlyphWiki:MainPage — it can be used as a bearable fallback for Source Han Serif / Noto CJK Serif. |
For a list of fonts to cover (almost) all of Unicode 13 + ConScript, go here. |
Where do the graphics shown on https://decodeunicode.org/en/u+200D3 come from? They obviously came from somewhere. Is there such a "generic complete CJK Unicode font" out there - and then software tools (OSX in my case) that can automatically substitute such "generic glyphs" if a particular glyph isn't available in whatever font is selected? My problem is in learning/dabbling with Chinese, and seeing all the "hex-code-boxes" that show up on Wiktionary for unusual glyphs (or for a part of a glyph, for instance showing a radical or a character simplification pattern - some are "standard radical", some might be other "non-radical graphic patterns" (?) - I don't know (obviously). For that matter, how were the code-points (ALL of them) defined in the first place? Presumably when they were defined, there was SOME graphical representation used. So why isn't there a "complete CJK" font out there somewhere, even if it doesn't look particularly good. This probably has an obvious answer.. somewhere.. The problem is how to find it.. |
Hanazono Mincho has almost all CJK glyphs (or about 110,000 of them, at least). That's too many for one font file, so make sure to install both HanaMinA and HanaMinB. The download link is on the release page. Look for |
Wow, next year is the 10th anniversary of Noto Sans CJK! If Ext-B is the real blocker for increased coverage, is it possible to split Ext-B into various stages? Among those 42,720 characters in Extension B, 17,985 (~42%) of them are from the KangXi Dictionary and are not supported in Noto Sans CJK (probably a bit more unsupported in Serif CJK, but the figure should be close). Due to the legacy of KangXi Dictionary, I'd argue that to-fu characters in this dictionary are relatively more frequently encountered, and supporting them should benifit the largest community, including regions that did not submit a reference glyphs for these characters, i.e. JP and HK. In this 17,985 codepoints, 17,978 (~99%) of them have a G (CN) source, 17,338 (~96%) of them have a T (TW) source, and 83 (~0.5%) of them have a K (KR) source. None of them has a J (JP) or H (HK) source. I believe supporting CN and TW sources are sufficient. My clueless speculation is that about 40-50% of the glyhs could be shared among CN and TW, probably a little bit more if subtle differences are ignored. For these less-frequently used characters, being able to show a meaningful and recognizable glyph instead of a tofu on a device is more important than to have a glyph with strokes exactly the same as the one on the Unicode code chart. And I wonder if generative AI can play a part in future iteration of the font... just a wild guess though. |
How about the characters from the following fonts (all sourced from Source Han Sans+Serif aka Noto Sans+Serif CJK) |
What steps will reproduce the problem?
What is the expected output? What do you see instead?
No tofu (it is right in the name of the font).
What version of the product are you using? On what operating system?
MacOS X 10.10.1 (Yosemite)
...
Er, uh, the same comment equally applies to Extensions C through E (Extension E
is being added to Unicode this year as Version 8.0). To play the devil's
advocate, I feel obliged to point out that it takes a non-trivial amount of
time to design glyphs for the tens of thousands of CJK Unified Ideographs that
are in Plane 2, so a great deal of patience is necessary.
...
Well, sure. Things do take time. At least we should probably document where
we are along the continuum from mo-tofu to noto-fu. I suggest a character set
coverage table or something. The project proposes to be a font for all of
Unicode, but if stuff like Extension B (which was in Unicode 3.1, and is 14
years old now) are always going to be on the nice to have list, then maybe we
need to call it "lessto", not "noto". I happen to care about Extension B
because my genealogy data includes given names which are out there in ExtB, so
they are (typically) tofu on my family tree website. It's also useful for
testing surrogate pair support with real text.
Moved from notofonts/noto-fonts#242
The text was updated successfully, but these errors were encountered: