Skip to content

Commit

Permalink
updated readme and toml
Browse files Browse the repository at this point in the history
  • Loading branch information
Bikatr7 committed Feb 18, 2024
1 parent 06bb247 commit ab0801d
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 12 deletions.
50 changes: 39 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@
- [Contact](#contact)
- [Contribution](#contribution)
- [Notes](#notes)
- [Inspirations](#inspirations)

---------------------------------------------------------------------------------------------------------------------------------------------------

## Kairyou

Quickly preprocesses Japanese text using NLP/NER from SpaCy for Japanese translation or other NLP tasks.

---------------------------------------------------------------------------------------------------------------------------------------------------
**Quick Start**<a name="quick-start"></a>
Expand All @@ -31,7 +38,7 @@ Follow the usage examples provided in the [Usage](#usage) section for detailed i

**Installation**<a name="installation"></a>

Pretty sure it requires 3.8 or higher, I haven't tested it on anything lower. 3.7 might work, but I'm not sure. Feedback is welcome.
Python 3.8 or higher, I haven't tested it on anything lower. 3.7 might work, but I'm not sure. Feedback is welcome.

Kairyou can be installed using pip:

Expand Down Expand Up @@ -64,21 +71,29 @@ ja_core_news_lg @ https://github.com/explosion/spacy-models/releases/download/ja

**Kairyou**<a name="kairyou"></a>

Kairyou simplifies the preprocessing of Japanese text for NLP and NER tasks. Here's an example of how to use it:
Kairyou is the global preprocessor client. Here's an example of how to use it:

```python
from kairyou import Kairyou

text = "Your Japanese text here."
replacement_json = "path/to/your/replacement_rules.json" ## or a dict of rules
preprocessed_text, log, error_log = Kairyou.preprocess(text, replacement_json)
preprocessed_text, preprocessing_log, error_log = Kairyou.preprocess(text, replacement_json)

print(preprocessed_text)
```

Kairyou is mostly just preprocess() there are other functions available, but they are not intended for direct use. The preprocess() function takes in a string of Japanese text and a JSON file or dictionary of replacement rules. It returns the preprocessed text, a log of the replacements made, and a log of any errors that occurred during the preprocessing (typically none).
Kairyou is mostly just preprocess(), but there are other functions available, however they are not intended for direct use. The preprocess() function takes in a string of Japanese text and a path to JSON file or dictionary of replacement rules. It returns the preprocessed text, a log of the replacements made, and a log of any errors that occurred during the preprocessing (typically none).

Currently, Kairyou supports two json types, "Kudasai" and "Fukuin". "Kudasai" is the native type and originated from that program, Fukuin is what the original onegai program used, as well as what the kroatoan's Fukuin program uses. No major differences in replacement are present between the two.

Note that rules must follow the format of the example JSON file [Blank Format JSON](examples/blank_replacements.json). You can also look at [COTE Replacements JSON](examples/cote_replacements.json) for an example of one that is filled out.
[Blank Kudasai Json](examples/blank_kudasai.json)

[Example Kudasai Json](examples/cote_kudasai.json)

[Blank Fukuin Json](examples/blank_fukuin.json)

[Example Fukuin Json](examples/cote_fukuin.json)

KatakanaUtil<a name="katakanautil"></a>

Expand All @@ -92,20 +107,23 @@ if KatakanaUtil.is_katakana_only(katakana_word):
print(f"{katakana_word} is composed only of Katakana characters.")
```

```
The following functions are available in KatakanaUtil:

is_katakana_only: Returns True if the input string is composed only of katakana characters.

is_actual_word: Returns True if the input string is a actual Japanese Katakana word (not just something made up or a name). List of words can be found [here](src/kairyou/words.py).

is_punctuation: Returns True if the input string is punctuation (Both Japanese and English punctuation are supported). List of punctuation can be found [here](src/kairyou/katakana_util.py).
```

---------------------------------------------------------------------------------------------------------------------------------------------------

**License**<a name="license"></a>

This project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE.md) file for details.
This project (Kairyou) is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE.md) file for details.

The GPL is a copyleft license that promotes the principles of open-source software. It ensures that any derivative works based on this project must also be distributed under the same GPL license. This license grants you the freedom to use, modify, and distribute the software.

Please note that this information is a brief summary of the GPL. For a detailed understanding of your rights and obligations under this license, please refer to the [full license text](LICENSE.md).

---------------------------------------------------------------------------------------------------------------------------------------------------

Expand All @@ -127,8 +145,18 @@ Contributions are welcome! I don't have a specific format for contributions, but

**Notes**<a name="notes"></a>

Kairyou was originally developed as a part of [Kudasai](https://github.com/Bikatr7/Kudasai), a Japanese preprocessor turned Machine Translator. It was later split off into its own package to be used independently of Kudasai for multiple reasons.
Kairyou was originally developed as a part of [Kudasai](https://github.com/Bikatr7/Kudasai), a Japanese preprocessor later turned Machine Translator. It was later split off into its own package to be used independently of Kudasai for multiple reasons.

Kairyou gets its name from the Japanese word "Reform" (改良) which is pronounced "Kairyou". Which was chosen for two reasons, the first being that it was chosen during a large Kudasai rework, and the second being that it is a Japanese preprocessor, and the name seemed fitting.

This package is also my first serious attempt at creating a Python package, so I'm sure there are some things that could be improved. Feedback is welcomed.

---------------------------------------------------------------------------------------------------------------------------------------------------

**Inspirations**<a name="inspirations"></a>

Kudasai and by extension Kairyou was originally derived from [Void's Script](https://github.com/Atreyagaurav/mtl-related-scripts) later [Onegai](https://github.com/Atreyagaurav/onegai)

Kairyou gets its name from the Japanese word "Reform" (改良) which is pronounced "Kairyou". Which was chosen for two reasons, the first being that it was chosen during a large Kudasai rework, and the second being that it is a Japanese preprocessor, and the name "Reform" seemed fitting.
Kairyou also took some inspiration from [Fukuin](https://github.com/kroatoanjp/nlp-mtl-preprocessing-script) and it's approach with Katakana.

This package is also my first serious attempt at creating a Python package, so I'm sure there are some things that could be improved. Feedback is welcomed.
Thanks to all of the above for the inspiration and the work they put into their projects.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "kairyou"
version = "v1.0.1"
version = "v1.1.0"
authors = [
{ name="Bikatr7", email="Tetralon07@gmail.com" },
]
Expand Down

0 comments on commit ab0801d

Please sign in to comment.