CJK ExtensionB, etc. #13

oliviazhu · 2015-06-09T23:08:42Z

What steps will reproduce the problem?

Install CJK fonts
Look at CJK text from Extension B
See tofu.
Sadness

What is the expected output? What do you see instead?
No tofu (it is right in the name of the font).

What version of the product are you using? On what operating system?
MacOS X 10.10.1 (Yosemite)
...
Er, uh, the same comment equally applies to Extensions C through E (Extension E
is being added to Unicode this year as Version 8.0). To play the devil's
advocate, I feel obliged to point out that it takes a non-trivial amount of
time to design glyphs for the tens of thousands of CJK Unified Ideographs that
are in Plane 2, so a great deal of patience is necessary.
...
Well, sure. Things do take time. At least we should probably document where
we are along the continuum from mo-tofu to noto-fu. I suggest a character set
coverage table or something. The project proposes to be a font for all of
Unicode, but if stuff like Extension B (which was in Unicode 3.1, and is 14
years old now) are always going to be on the nice to have list, then maybe we
need to call it "lessto", not "noto". I happen to care about Extension B
because my genealogy data includes given names which are out there in ExtB, so
they are (typically) tofu on my family tree website. It's also useful for
testing surrogate pair support with real text.

Moved from notofonts/noto-fonts#242

dougfelt · 2015-08-27T21:43:48Z

Copied assignee over from the original issue.

This is a request, essentially, for coverage of every single Unicode CJK point, which as Ken points out would take some time, so could be considered a feature request. But the submitter points out that 'Noto' means 'no tofu', so arguably it's a defect that we don't cover all of CJK even as defined in Unicode 3.1.

jungshik · 2015-08-28T23:19:35Z

This is working-as-intended for the current stage of the project.

We do cover a small subset ( < 3,000) of Plane 2 characters (CJK Ext B, C, D) as used in Hong Kong, included in JIS X 213 and so forth. However, the full coverage of Plane 2 CJK Ideographs is beyond the scope of our current Pan CJK fonts.

We're keenly aware of the lack of the coverage for them. At this point, when and whether we'll cover them is still undecided.

anyong · 2016-02-24T17:57:38Z

There are 60 characters in the extension planes that are official Taiwanese (Southern Min Chinese) characters as designated by the Taiwanese Ministry of Education and in their comprehensive dictionary that are in Unicode and not in any font that I know of (except for Hanizono). It would be great if these were in Noto Sans, as I do a lot of work in Taiwanese language materials and these characters always look different since they always show in Hanizono although I prefer to use Noto.

There are also 7 characters missing from the Unicode spec, but I'm not sure how one would go about getting those added to the spec before adding them to fonts.

Here are the lists:

In Unicode but missing from Noto (download Hanizono font HanaMinB to see):

𫝛 𫟊 𪜶 𫞼 𬦰 𠯗 𧉅 𧉟 𠞭 𠢕
𢲸 𨂿 𨃟 𢻷 𢼌 𣮈 𩨑 𡳞 𩵱 𧜞
𪁎 𤖯 𩛩 𠕆 𥰔 𥕥 𥴊 𥽕 𢓜 𪁎
𥍉 𠲿 𤶃 𧮙 𩟗 𪐞 𤲍 𩏠 𦊓 𩸙
𧺤 𩚨 𢄧 𩜇 𣁳 𦟪 𤺅 𧿳 𧌄 𤺪
𧿬 𣍐 𢪱 𩸶 𦜆 𨂾 𫝏 𫝺 𫝻 𫟂

The character missing from HanaMinB, 𬦰, is 足百 put together, it exists in the Unicode spec but not in any fonts that I'm aware of.

Missing from Unicode spec (join radicals left to right as single character):

率刂
歹差
氐頁
氵雀戈
豖殳
疒哥
石匹

Would be awesome if Noto CJK could include these characters, or at least the 60 that already have designated code points!

I would be happy to make the vector images for these characters based on existing Noto CJK characters, if that would make it easier/faster to get them included in the next release.

KrasnayaPloshchad · 2017-05-02T15:44:09Z

Is it possible to create additional fonts to give support for such characters? Hanazono Mincho did it.

dougfelt · 2017-05-02T16:25:26Z

@jungshik, @kenlunde care to comment on the Min characters reportedly listed in the Taiwan MoE dictionary?

kenlunde · 2017-05-02T17:03:53Z

@dougfelt: Here's my response:

Apparently, @anyong didn't look hard enough for the ideographs that were reported as unencoded, because all of them, except one that is in the process of being encoded, were lurking in Extension B:

⿰率刂 → U+207A9 𠞩 (Extension B)
⿰歹差 → U+23A48 𣩈 (Extension B)
⿰氐頁 → U+2947E 𩑾 (Extension B)
⿲氵雀戈, ⿰氵𢧵, or ⿰𣼎戈 → U+24062 𤁢 (Extension B)
⿰豖殳 → U+27C35 𧰵 (Extension B)
⿸疒哥 → UTC-02663 (included in IRG Working Set 2015; aka Extension G)
⿰石匹 → U+25435 𥐵 (Extension B)

As to the list of 60 that were reported as missing from Noto Sans CJK, two of them are actually included (because they are within the scope of Hong Kong SCS-2008):

U+25565 𥕥 (Extension B)
U+280BE 𨂾 (Extension B)

51 (including the two above) are in Extension B, one is in Extension C, seven are in Extension D, and one is in Extension E.

In any case, being included in a dictionary may certainly qualify for encoding a character, particularly if it is a head entry. However, these characters are outside the current scope of the character set. The highest-priority issue for the forthcoming Version 2.000 update is to support Hong Kong SCS-2016 both in terms of characters and appropriate HK glyphs, and the next highest-priority issue is to support the additional ideographs for Korean names in Issue #80. Anything beyond that will depend on whether there are available CIDs, and will need to be considered on a case-by-case basis. There also needs to be mutual agreement between Adobe and Google, because ultimately someone will need to pay $¥₩ to have additional glyphs designed.

audreyt · 2017-05-28T09:09:40Z

⿲氵雀戈 is encoded as U+24062 𤁢, no?

ref: https://github.com/ethantw/moedict-idc/blob/master/cjk-ext.tsv#L103

kenlunde · 2017-05-28T12:20:45Z

@audreyt: Yep. Will edit my response above, from unencoded to U+24062 𤁢.

anyong · 2019-01-13T02:20:37Z

Closing in on 2 years later, any possibility of getting these into Noto CJK? These characters not being available, especially on Android phones, is a major hindrance to writing Taiwanese in everyday conversation and thus further development of the sense of "normalcy" around it. Some characters listed above such as 𠢕 ("to be good at sth") have a very high frequency in Taiwanese, and are often replaced by similar sounding but meaningless-in-context characters such as 猴 (monkey).

kenlunde · 2019-01-13T03:12:59Z

@anyong The short answer is: Don't hold your breath.

The long answer is: Extension B is the proverbial "800-pound 🦍" when it comes to CJK Unified Ideographs. In related news, if Taiwan's standards weren't such a mess, it'd be possible to determine which characters beyond CNS 11643 Planes 1 and 2 (aka Big Five) are frequently used. While we eventually plan to support Extension B and beyond, it will take years and possibly a decade or more to achieve that goal.

The image below puts Extension B into perspective, in terms of how much of Plane 2 (aka SIP) it occupies:

anyong · 2019-01-13T04:00:02Z

@kenlunde I can certainly appreciate the problems presented by the size of Ext. B. On the other hand, I'm asking about a few dozen characters in particular that are defined by the Taiwanese Ministry of Education as part of the Taiwanese orthography.

Certainly it should be possible to prioritize some small subsets over the entirety of Ext. B. Could you explain what you mean by Taiwan's standards being "such a mess"?

I am sure @audreyt could help resolve any issues to get a definitive answer on what should be prioritized. At the very least, the characters in the MOE's set of frequently used Taiwanese characters should be included as they are indeed frequently used.

kenlunde · 2019-01-13T19:09:33Z

@anyong See Source Han Sans Issue #222. In other words, the dot-release that is currently targeted for mid-April will not include glyph for these characters, but they will be considered for the next major release.

anyong · 2019-01-17T02:51:46Z

@kenlunde thanks! If there is anything I could do to help, I would be happy to.

kenlunde · 2019-01-17T04:57:42Z

@anyong As long as the listing in that Source Han Sans issue is accurate, there's nothing further that you need to do.

davelab6 · 2019-04-15T22:47:55Z

Thanks for discussing this @anyong - I'm keeping this issue open, as it seems it will eventually end up in a future release.

kenlunde · 2019-04-15T22:58:22Z

@davelab6 The reasons why I suggested closing this issue is because 1) supporting additional CJK Unified Ideographs is so blatantly obvious; and 2) supporting Extension B and beyond in their entirety will take many years.

davelab6 · 2019-04-16T02:19:19Z

they will be considered for the next major release

Is this publicly documented anywhere? :) If not, lets chat about this next time we speak privately, I'd like to better understand any loose ideas about the roadmap

kenlunde · 2019-04-17T03:01:28Z

@davelab6 Affirmative. See Source Han Sans Issue 222.

The reason why I suggest that this issue be closed is because it will take years or decades to fully support the CJK Unified Ideograph extensions. The graphic that I posted on 2019-01-12 illustrates this extraordinarily well.

ghost · 2021-04-16T09:59:39Z

According to the @kenlunde 's image, it's a bunch of work! 😮

When you have to render these characters, use other fonts (such as MS Gothic) for now.
Please see this.

These are the huge numbers of characters to support

There are:
-6592 characters in Extension A
-42718 characters in Extension B (which is the biggest)
-4149 characters in Extension C
-222 characters in Extension D
-5762 characters in Extension E
-7473 characters in Extension F
-542 characters in Compatibility Supplement
-4939 characters in Extension G
(Source: en.wikipedia.org)

Altogether, there are 72397 characters to design!

That's not all: Unicode is growing year by year, and the numbers above will grow toghether with Unicode.
So, @kenlunde is right.

twardoch · 2021-04-16T10:11:45Z

The numbers are even larger. Many of these characters need separate traditional and simplified glyph forms, I imagine (and that's what the issue suggests to which @kenlunde referred).

twardoch · 2021-04-16T10:24:06Z

If you need a free font to fill the tofu blanks, you may consider https://github.com/cjkvi/HanaMinAFDKO/releases which is built from http://en.glyphwiki.org/wiki/GlyphWiki:MainPage — it can be used as a bearable fallback for Source Han Serif / Noto CJK Serif.

ghost · 2021-04-21T08:04:29Z

For a list of fonts to cover (almost) all of Unicode 13 + ConScript, go here.

Jimw338 · 2023-01-11T16:58:03Z

Where do the graphics shown on https://decodeunicode.org/en/u+200D3 come from? They obviously came from somewhere. Is there such a "generic complete CJK Unicode font" out there - and then software tools (OSX in my case) that can automatically substitute such "generic glyphs" if a particular glyph isn't available in whatever font is selected?

My problem is in learning/dabbling with Chinese, and seeing all the "hex-code-boxes" that show up on Wiktionary for unusual glyphs (or for a part of a glyph, for instance showing a radical or a character simplification pattern - some are "standard radical", some might be other "non-radical graphic patterns" (?) - I don't know (obviously).

For that matter, how were the code-points (ALL of them) defined in the first place? Presumably when they were defined, there was SOME graphical representation used. So why isn't there a "complete CJK" font out there somewhere, even if it doesn't look particularly good.

This probably has an obvious answer.. somewhere.. The problem is how to find it..

anyong · 2023-01-12T16:20:03Z

Hanazono Mincho has almost all CJK glyphs (or about 110,000 of them, at least). That's too many for one font file, so make sure to install both HanaMinA and HanaMinB. The download link is on the release page. Look for hanazono-20170904.zip on that page.

tamcy · 2023-08-09T05:28:36Z

Wow, next year is the 10th anniversary of Noto Sans CJK!

If Ext-B is the real blocker for increased coverage, is it possible to split Ext-B into various stages? Among those 42,720 characters in Extension B, 17,985 (~42%) of them are from the KangXi Dictionary and are not supported in Noto Sans CJK (probably a bit more unsupported in Serif CJK, but the figure should be close).

Due to the legacy of KangXi Dictionary, I'd argue that to-fu characters in this dictionary are relatively more frequently encountered, and supporting them should benifit the largest community, including regions that did not submit a reference glyphs for these characters, i.e. JP and HK.

In this 17,985 codepoints, 17,978 (~99%) of them have a G (CN) source, 17,338 (~96%) of them have a T (TW) source, and 83 (~0.5%) of them have a K (KR) source. None of them has a J (JP) or H (HK) source. I believe supporting CN and TW sources are sufficient. My clueless speculation is that about 40-50% of the glyhs could be shared among CN and TW, probably a little bit more if subtle differences are ignored. For these less-frequently used characters, being able to show a meaningful and recognizable glyph instead of a tofu on a device is more important than to have a glyph with strokes exactly the same as the one on the Unicode code chart.

And I wonder if generative AI can play a part in future iteration of the font... just a wild guess though.

tommai4881 · 2023-08-14T14:54:42Z

How about the characters from the following fonts (all sourced from Source Han Sans+Serif aka Noto Sans+Serif CJK)
Gothic Nguyên: https://github.com/TKYKmori/Gothic-Nguyen
Minh Nguyên: https://github.com/TKYKmori/Minh-Nguyen

dougfelt added the enhancement label Aug 27, 2015

dougfelt assigned ghost Aug 27, 2015

dougfelt added Type-Defect and removed enhancement labels Aug 27, 2015

dougfelt closed this as completed Aug 27, 2015

dougfelt reopened this Aug 27, 2015

jungshik added Type-Enhancement and removed Type-Defect labels Aug 28, 2015

roozbehp added the Android label Dec 10, 2015

kenlunde mentioned this issue Dec 21, 2015

Does not support GB18030-2005 characters display #54

Closed

roozbehp added the Priority-Medium label Feb 17, 2016

shiami mentioned this issue Jan 29, 2017

臺語的漢字字型仍無法顯示所有字 ChhoeTaigi/TaigiDict_Android#5

Open

kenlunde mentioned this issue Jan 13, 2019

Consolidation of V3 Character/Glyph Addition Suggestions adobe-fonts/source-han-sans#222

Open

davelab6 added Out of Scope and removed Out of Scope labels Apr 15, 2019

tamcy mentioned this issue Feb 19, 2020

Feedback #165

Closed

punchcutter mentioned this issue May 12, 2021

Doesn't render 𤯥 U-24BE5. adobe-fonts/source-han-sans#299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK ExtensionB, etc. #13

CJK ExtensionB, etc. #13

oliviazhu commented Jun 9, 2015

dougfelt commented Aug 27, 2015

jungshik commented Aug 28, 2015

anyong commented Feb 24, 2016

KrasnayaPloshchad commented May 2, 2017

dougfelt commented May 2, 2017

kenlunde commented May 2, 2017 •

edited

Loading

audreyt commented May 28, 2017

kenlunde commented May 28, 2017

anyong commented Jan 13, 2019

kenlunde commented Jan 13, 2019

anyong commented Jan 13, 2019

kenlunde commented Jan 13, 2019

anyong commented Jan 17, 2019

kenlunde commented Jan 17, 2019

davelab6 commented Apr 15, 2019

kenlunde commented Apr 15, 2019

davelab6 commented Apr 16, 2019

kenlunde commented Apr 17, 2019

ghost commented Apr 16, 2021 •

edited by ghost

Loading

twardoch commented Apr 16, 2021 •

edited

Loading

twardoch commented Apr 16, 2021

ghost commented Apr 21, 2021

Jimw338 commented Jan 11, 2023

anyong commented Jan 12, 2023

tamcy commented Aug 9, 2023

tommai4881 commented Aug 14, 2023 •

edited by punchcutter

Loading

CJK ExtensionB, etc. #13

CJK ExtensionB, etc. #13

Comments

oliviazhu commented Jun 9, 2015

dougfelt commented Aug 27, 2015

jungshik commented Aug 28, 2015

anyong commented Feb 24, 2016

KrasnayaPloshchad commented May 2, 2017

dougfelt commented May 2, 2017

kenlunde commented May 2, 2017 • edited Loading

audreyt commented May 28, 2017

kenlunde commented May 28, 2017

anyong commented Jan 13, 2019

kenlunde commented Jan 13, 2019

anyong commented Jan 13, 2019

kenlunde commented Jan 13, 2019

anyong commented Jan 17, 2019

kenlunde commented Jan 17, 2019

davelab6 commented Apr 15, 2019

kenlunde commented Apr 15, 2019

davelab6 commented Apr 16, 2019

kenlunde commented Apr 17, 2019

ghost commented Apr 16, 2021 • edited by ghost Loading

These are the huge numbers of characters to support

twardoch commented Apr 16, 2021 • edited Loading

twardoch commented Apr 16, 2021

ghost commented Apr 21, 2021

Jimw338 commented Jan 11, 2023

anyong commented Jan 12, 2023

tamcy commented Aug 9, 2023

tommai4881 commented Aug 14, 2023 • edited by punchcutter Loading

kenlunde commented May 2, 2017 •

edited

Loading

ghost commented Apr 16, 2021 •

edited by ghost

Loading

twardoch commented Apr 16, 2021 •

edited

Loading

tommai4881 commented Aug 14, 2023 •

edited by punchcutter

Loading