Skip to content
This repository has been archived by the owner on Dec 3, 2018. It is now read-only.
/ cjk-decomp Public archive

Decomposition data for 75,000 CJK ideographs; fork (with fixes) of

License

Notifications You must be signed in to change notification settings

amake/cjk-decomp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Note: This repo is unmaintained. For better data in a standard format that is actively maintained, I strongly recommend using cjkvi-ids instead (see ids.txt)

CJK Decomposition Data

The CJK Decomposition Data File is a graphical analysis of the approx 75,000 Chinese/Japanese characters in Unicode.

The data file is in UTF-8 encoding and was originally compiled by Gavin Grover (original project). It is distributed under 6 licenses, of which you only need choose one:

The data comprises the 36 strokes (U+31C0..U+31E3), the 115 radicals (U+2E80..U+2EF3, except U+2E9A), the 20,924 unified characters (U+4E00..U+9FBB), the 12 unique characters from the compatibility range (U+F900..U+FAD9), the 6,582 extension A characters (U+3400..U+4DB5), the 42,711 extension B characters (U+20000..U+2A6D6), the 4,149 extension C characters (U+2A700..U+2B734), and the 222 extension D characters (U+2B740..U+2B81D).

Each record has 3 fields, viz, the character being defined, the type of decomposition, and a list of zero or more constituent components, like so:

的:a(白,勺)

The character being defined and the constituent components are either a Unihan token, in the basic or a supplemental plane, or a 5-digit number representing an intermediate decomposition not in Unicode. There are approx 10,000 such intermediate decompositions.

If you need a font, you can use the Hanazono font.

Only pictorial configurations are used, not semantic ones. Where characters have typeface differences I've used the one provided by the Mainland Chinese contribution to Unicode. When there's more than one possible configuration, I've selected one only.

The possible configurations and their meanings are:

Code regex Meaning Number of possible constituents
c component 0
m.* modified in some way, e.g. me=equivalent, msp=special, mo=outline, ml=left radical version 1
w.* second constituent contained within first in some way, e.g. w=within at the center, wbl=within at bottom left 2
ba|d second between first moving across or downwards 2
lock components locked together 2
s.* first component surrounds second, e.g. s=surrounds fully, str=surrounds around the top-right 2
a flows across >=2
d flows downwards >=2
r.* repeats and/or reflects in some way, e.g. refh=reflect horizontally, rot=rotate 180 degrees, rrefr= repeat with a reflection rightwards, ra=repeat across, r3d=repeat 3 times downwards, r3tr=repeat in a triangle, rst=repeat surrounding around the top 1

The s, a, d, and r codes may be followed by /t, /m, /s, or /o, to show whether the join touches, molds, snaps together, or overlaps, respectively.

Some more work needs to be done, including reducing the quantity of intermediate components by removing duplicates, lowering the number of components in many sequences, reanalysis of decomposition configurations, and of course quality checking and corrections.

See also

About

Decomposition data for 75,000 CJK ideographs; fork (with fixes) of

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published