Skip to content

Commit

Permalink
add ཞིང་ as clause boundary
Browse files Browse the repository at this point in the history
  • Loading branch information
drupchen committed Jul 27, 2023
1 parent fea72ac commit 90bf27e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion botok/tokenizers/sentencetokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
"ཏེ་",
"དེ་",
] # separated because these seem to cut long sentences
clause_boundaries = te_particles + ["ནས་", "ན་", "ལ་"]
clause_boundaries = te_particles + ["ནས་", "ན་", "ལ་", "ཞིང་"]
dagdra = ["པ་", "བ་", "པོ་", "བོ་"]

normalization_patterns = [(' <utt>', ''),
Expand Down

0 comments on commit 90bf27e

Please sign in to comment.