CJK Decomposition Data

Note: This repo is unmaintained. For better data in a standard format that is actively maintained, I strongly recommend using cjkvi-ids instead (see ids.txt)

CJK Decomposition Data

The CJK Decomposition Data File is a graphical analysis of the approx 75,000 Chinese/Japanese characters in Unicode.

The data file is in UTF-8 encoding and was originally compiled by Gavin Grover (original project). It is distributed under 6 licenses, of which you only need choose one:

The data comprises the 36 strokes (U+31C0..U+31E3), the 115 radicals (U+2E80..U+2EF3, except U+2E9A), the 20,924 unified characters (U+4E00..U+9FBB), the 12 unique characters from the compatibility range (U+F900..U+FAD9), the 6,582 extension A characters (U+3400..U+4DB5), the 42,711 extension B characters (U+20000..U+2A6D6), the 4,149 extension C characters (U+2A700..U+2B734), and the 222 extension D characters (U+2B740..U+2B81D).

Each record has 3 fields, viz, the character being defined, the type of decomposition, and a list of zero or more constituent components, like so:

的:a(白,勺)

The character being defined and the constituent components are either a Unihan token, in the basic or a supplemental plane, or a 5-digit number representing an intermediate decomposition not in Unicode. There are approx 10,000 such intermediate decompositions.

If you need a font, you can use the Hanazono font.

Only pictorial configurations are used, not semantic ones. Where characters have typeface differences I've used the one provided by the Mainland Chinese contribution to Unicode. When there's more than one possible configuration, I've selected one only.

The possible configurations and their meanings are:

Code regex	Meaning	Number of possible constituents
`c`	component	0
`m.*`	modified in some way, e.g. `me`=equivalent, `msp`=special, `mo`=outline, `ml`=left radical version	1
`w.*`	second constituent contained within first in some way, e.g. `w`=within at the center, `wbl`=within at bottom left	2
`ba\|d`	second between first moving across or downwards	2
`lock`	components locked together	2
`s.*`	first component surrounds second, e.g. `s`=surrounds fully, `str`=surrounds around the top-right	2
`a`	flows across	>=2
`d`	flows downwards	>=2
`r.*`	repeats and/or reflects in some way, e.g. `refh`=reflect horizontally, `rot`=rotate 180 degrees, `rrefr`= repeat with a reflection rightwards, `ra`=repeat across, `r3d`=repeat 3 times downwards, `r3tr`=repeat in a triangle, `rst`=repeat surrounding around the top	1

The s, a, d, and r codes may be followed by /t, /m, /s, or /o, to show whether the join touches, molds, snaps together, or overlaps, respectively.

Some more work needs to be done, including reducing the quantity of intermediate components by removing duplicates, lowering the number of components in many sequences, reanalysis of decomposition configurations, and of course quality checking and corrections.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
cjk-decomp.txt		cjk-decomp.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Note: This repo is unmaintained. For better data in a standard format that is actively maintained, I strongly recommend using cjkvi-ids instead (see ids.txt)

CJK Decomposition Data

See also

About

Releases

Sponsor this project

Packages

License

amake/cjk-decomp

Folders and files

Latest commit

History

Repository files navigation

Note: This repo is unmaintained. For better data in a standard format that is actively maintained, I strongly recommend using cjkvi-ids instead (see ids.txt)

CJK Decomposition Data

See also

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Packages