Skip to content

A missing toolkit for Khmer Natural Language Processing.

License

Notifications You must be signed in to change notification settings

seanghay/khmernormalizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Khmer Normalizer

A missing toolkit for Khmer Natural Language Processing.

  • Character Reordering
  • Duplicate Whitespaces
  • Remove zero width space
  • Remove emojis
  • Fix Common misspellings
  • Fix Unicode issues
  • Fix Khmer trailing vowels
  • URL Replacements
  • Unicode Normalization (NFKC)
  • Quotes symbols normalization
  • Remove repeated punctuations

Installation

pip install khmernormalizer

Usage

from khmernormalizer import normalize

input_str = """
តាម៖៖​សេចក្តី​រាយ​ការណ៍​​ឲ្យ​ដឹង​ថា!!!!!
https://google.com/a?x=1
កាល 😂 ពីវេលាម៉ោង    ៗ      ប្រមាណ១១យប់ថ្ងៃទី៤ 😂😂😂😂😂 ??
កាាាាត់
មិិិិិន 
មួយរយះះះះះះះ
រយះពេល
""".strip()

normalize(input_str, 
          emoji_replacement="", 
          remove_zwsp=True, 
          url_replacement="")

Result:

តាម៖សេចក្តីរាយការណ៍ឱ្យដឹងថា!

កាល ពីវេលាម៉ោងៗ ប្រមាណ១១យប់ថ្ងៃទី៤?
កាត់
មិន 
មួយរយៈ
រយៈពេល