Skip to content

Commit

Permalink
improve support for news articles
Browse files Browse the repository at this point in the history
  • Loading branch information
drupchen committed Jul 20, 2023
1 parent 5051708 commit fea72ac
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions botok/tokenizers/sentencetokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,15 @@
"བགྱི་",
"བྱ་",
"བཞུགས་",
"འདུག",
"འདུག་",
"སོང་",
]
te_particles = [
"སྟེ་",
"ཏེ་",
"དེ་",
] # separated because these seem to cut long sentences
clause_boundaries = te_particles + ["ནས་", "ན་"]
clause_boundaries = te_particles + ["ནས་", "ན་", "ལ་"]
dagdra = ["པ་", "བ་", "པོ་", "བོ་"]

normalization_patterns = [(' <utt>', ''),
Expand Down

0 comments on commit fea72ac

Please sign in to comment.