Skip to content

A Python library to split a Chinese Pinyin phrase into possible permutations of Chinese Pinyin words

License

Notifications You must be signed in to change notification settings

lstrobel/py-pinyin-split

 
 

Repository files navigation

py-pinyin-split

A Python library for splitting Hanyu Pinyin words into syllables. Built on NLTK's tokenizer interface, it handles standard syllables defined in the Pinyin Table and supports tone marks.

Based originally on pinyinsplit by @tomlee.

PyPI: https://pypi.org/project/py-pinyin-split/

Installation

pip install py-pinyin-split

Usage

Instantiate a tokenizer and split away.

The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a ValueError

The tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.

from py_pinyin_split import PinyinTokenizer

tokenizer = PinyinTokenizer()

# Basic splitting
tokenizer.tokenize("nǐhǎo")  # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng")  # ['Běi', 'jīng']

# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?")  # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!")  # ['Wǒ', 'hěn', 'hǎo', '!']

# Handles ambiguous splits using heuristics
tokenizer.tokenize("kěnéng") == ["kě", "néng"]
tokenizer.tokenize("rènào") == ["rè", "nào"]
tokenizer.tokenize("xīan") == ["xī", "an"]
tokenizer.tokenize("xián") == ["xián"]
tokenizer.tokenize("wǎn'ān") == ["wǎn", "'", "ān"]

# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān")  # ['xī', 'ān']
tokenizer.tokenize("xián")  # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]

# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello")  # ValueError

# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang")  # ['duang']

Related Projects

About

A Python library to split a Chinese Pinyin phrase into possible permutations of Chinese Pinyin words

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%