py-pinyin-split

A Python library for splitting Hanyu Pinyin words into syllables. Built on NLTK's tokenizer interface, it handles standard syllables defined in the Pinyin Table and supports tone marks.

Based originally on pinyinsplit by @tomlee.

PyPI: https://pypi.org/project/py-pinyin-split/

Installation

pip install py-pinyin-split

Usage

Instantiate a tokenizer and split away.

The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a ValueError

The tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.

from py_pinyin_split import PinyinTokenizer

tokenizer = PinyinTokenizer()

# Basic splitting
tokenizer.tokenize("nǐhǎo")  # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng")  # ['Běi', 'jīng']

# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?")  # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!")  # ['Wǒ', 'hěn', 'hǎo', '!']

# Handles ambiguous splits using heuristics
tokenizer.tokenize("kěnéng") == ["kě", "néng"]
tokenizer.tokenize("rènào") == ["rè", "nào"]
tokenizer.tokenize("xīan") == ["xī", "an"]
tokenizer.tokenize("xián") == ["xián"]
tokenizer.tokenize("wǎn'ān") == ["wǎn", "'", "ān"]

# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān")  # ['xī', 'ān']
tokenizer.tokenize("xián")  # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]

# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello")  # ValueError

# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang")  # ['duang']

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
src/py_pinyin_split		src/py_pinyin_split
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

py-pinyin-split

Installation

Usage

Related Projects

About

Releases 12

Packages

Languages

License

lstrobel/py-pinyin-split

Folders and files

Latest commit

History

Repository files navigation

py-pinyin-split

Installation

Usage

Related Projects

About

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Languages

Packages