This project consists of a library that provides access to some of the data contained in the Unicode Character Database.
Unicode 13.0 Emoji 13.0
UnicodeRadicalStrokeCount.StrokeCount is now of type System.SByte instead of type System.Byte.
Grab the latest version of the package on NuGet: https://www.nuget.org/packages/UnicodeInformation/. Once the library is installed in your project, you will find everything you need in the System.Unicode namespace.
Everything provided by the library will be under the namespace System.Unicode
.
XML documentation should be complete enough so that you can navigate the API without getting lost.
In its current state, the project is written in C# 7.3, compilable by Roslyn, and targets both .NET Standard 2.0 and .NET Standard 1.1. The library UnicodeInformation includes a (large) subset of the official Unicode Character Database stored in a custom file format.
The following program will display informations on a few characters:
using System;
using System.Text;
using System.Unicode;
namespace Example
{
internal static class Program
{
private static void Main()
{
Console.OutputEncoding = Encoding.Unicode;
PrintCodePointInfo('A');
PrintCodePointInfo('∞');
PrintCodePointInfo(0x1F600);
}
private static void PrintCodePointInfo(int codePoint)
{
var charInfo = UnicodeInfo.GetCharInfo(codePoint);
Console.WriteLine(UnicodeInfo.GetDisplayText(charInfo));
Console.WriteLine("U+" + codePoint.ToString("X4"));
Console.WriteLine(charInfo.Name ?? charInfo.OldName);
Console.WriteLine(charInfo.Category);
}
}
}
Explanations:
UnicodeInfo.GetCharInfo(int)
returns a structureUnicodeCharInfo
that provides access to various bit of information associated with the specified code point.UnicodeInfo.GetDisplayText(UnicodeCharInfo)
is a helper method that computes a display text for the specified code point. Since some code points are not designed to be displayed in a standalone fashion, this will try to make the specified character more displayable. The algorithm used to provide a display text is quite simplistic, and will only affect very specific code points. (e.g. Control Characters) For most code points, this will simply return the direct string representation.UnicodeCharInfo.Name
returns the name of the code point as specified by the Unicode standard. Please note that some characters will, by design, not have any name assigned to them in the standard. (e.g. control characters) Those characters, however may have alternate names assigned to them, that you can use as fallbacks. (e.g.UnicodeCharInfo.OldName
)UnicodeCharInfo.OldName
returns the name of the character as defined in Unicode 1.0, when applicable and different from the current name.UnicodeCharInfo.Category
returns the category assigned to the specified code point.
- Name
- General_Category
- Canonical_Combining_Class
- Bidi_Class
- Decomposition_Type
- Decomposition_Mapping
- Numeric_Type (See also kAccountingNumeric/kOtherNumeric/kPrimaryNumeric. Those will set Numeric_Type to Numeric.)
- Numeric_Value
- Bidi_Mirrored
- Unicode_1_Name
- Simple_Uppercase_Maping
- Simple_Lowercase_Mapping
- Simple_Titlecase_Mapping
- Name_Alias
- Block
- ASCII_Hex_Digit
- Bidi_Control
- Dash
- Deprecated
- Diacritic
- Extender
- Hex_Digit
- Hyphen
- Ideographic
- IDS_Binary_Operator
- IDS_Trinary_Operator
- Join_Control
- Logical_Order_Exception
- Noncharacter_Code_Point
- Other_Alphabetic
- Other_Default_Ignorable_Code_Point
- Other_Grapheme_Extend
- Other_ID_Continue
- Other_ID_Start
- Other_Lowercase
- Other_Math
- Other_Uppercase
- Pattern_Syntax
- Pattern_White_Space
- Quotation_Mark
- Radical
- Soft_Dotted
- STerm
- Terminal_Punctuation
- Unified_Ideograph
- Variation_Selector
- White_Space
- Lowercase
- Uppercase
- Cased
- Case_Ignorable
- Changes_When_Lowercased
- Changes_When_Uppercased
- Changes_When_Titlecased
- Changes_When_Casefolded
- Changes_When_Casemapped
- Alphabetic
- Default_Ignorable_Code_Point
- Grapheme_Base
- Grapheme_Extend
- Grapheme_Link
- Math
- ID_Start
- ID_Continue
- XID_Start
- XID_Continue
- Unicode_Radical_Stroke (This is actually kRSUnicode from the Unihan database)
- Code point cross references extracted from NamesList.txt
NB: The UCD property ISO_Comment will never be included since this one is empty in all new Unicode versions.
- Emoji
- Emoji_Presentation
- Emoji_Modifier
- Emoji_Modifier_Base
- Emoji_Component
- Extended_Pictographic
- kAccountingNumeric
- kOtherNumeric
- kPrimaryNumeric
- kRSUnicode
- kDefinition
- kMandarin
- kCantonese
- kJapaneseKun
- kJapaneseOn
- kKorean
- kHangul
- kVietnamese
- kSimplifiedVariant
- kTraditionalVariant
The project UnicodeInformation.Builder takes cares of generating a file named ucd.dat. This file contains Unicode data compressed by .NET's deflate algorithm, and should be included in UnicodeInformation.dll at compilation.