jp-broom: A sopshisticated Japanese text cleaner

Description

This project provides a simple to use text cleaner that removes most of the unecessary characters and symbols in Japanese text, while keeping the original spirit of the text intact.

Japanese texts require different techniques from conventional western ones, as they tend to include:

A mix of half-width and full-width characters
Symbols not in the western lexicon (e.g. □)
Kaomoji (i.e. japanese emojis (o(ﾟ▽ﾟ)o) )

Original work is from https://github.com/ku-nlp/text-cleaning, but the code has been heavily modified at this point. Primary changes include:

Conversion from a script-based to a package-based structure
The package direction moving towards suitability for NLP projects (i.e. support Spacy)
More features, and made more modular. If you don't like the main cleaning functions you can just used the underlying helper functions.
Support for more modern versions of Python.

Usage

from jp_broom.clean_text import clean_light, clean_deep, clean_deep_tokenize

mytest_text = """
ダンジョンの秘匿は罪に当たらないが200、国民の義務に違反しているということでその後も警察の監視がつくらしい……。
"""

text_clean_light = clean_light(mytest_text)
# output = "ダョ秘匿は罪に当たらなが200、国民義務に違反てるとうことでそ後警察監視がつくら……。"
text_clean_deep = clean_deep(mytest_text)
# output = "ダョ秘匿は罪に当たらなが 国民義務に違反てるとうことでそ後警察監視がつくら"
text_clean_deep_with_nlp, tokens = clean_deep_tokenize(mytest_text)
# output = "ダョ 秘匿 罪 当たる 国民 義務 違反 てる う そ 後 警察 監視 つくる"

Requirements

Python 3.11+
See pyproject.toml for used packages

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
jp_broom		jp_broom
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jp-broom: A sopshisticated Japanese text cleaner

Description

Usage

Requirements

About

Releases 2

Packages

Languages

License

MichaelVerdegaal/jp-broom

Folders and files

Latest commit

History

Repository files navigation

jp-broom: A sopshisticated Japanese text cleaner

Description

Usage

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages