GitHub - seanghay/khmercut: A (fast) Khmer word segmentation toolkit.

khmercut

A (fast) Khmer word segmentation toolkit.

A single python file
Using pycrfsuite only
Include Khmer normalize
CLI Supoprt
Multiprocess support

pip install khmercut

Python

from khmercut import tokenize

tokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']

CLI

e.g.

khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"

Available options

usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]

A fast Khmer word segmentation toolkit.

positional arguments:
  files                 Path to text files

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Output folder
  -s SEPARATOR, --separator SEPARATOR
                        Specify token separator
  -j JOBS, --jobs JOBS  Number of processors
  -q, --quiet           Disable progress output
  -n, --normalize       Normalize input text before processing

Reference

Khmer language processing toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bin		bin
khmercut		khmercut
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

khmercut

Python

CLI

Reference

About

Languages

seanghay/khmercut

Folders and files

Latest commit

History

Repository files navigation

khmercut

Python

CLI

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages