Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor split_text function to handle Chinese text more effectively #67

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Glowin
Copy link

@Glowin Glowin commented May 31, 2024

  • Added regular expression to split Chinese text into sentences based on punctuation marks.
  • Ensured that each chunk's length is as close as possible to max_chars without splitting sentences abruptly.

- Added regular expression to split Chinese text into sentences based on punctuation marks.
- Ensured that each chunk's length is as close as possible to max_chars without splitting sentences abruptly.
@p0n1
Copy link
Owner

p0n1 commented Jun 28, 2024

Thanks! Will take a look into this ASAP.

@Bryksin
Copy link
Collaborator

Bryksin commented Aug 24, 2024

@p0n1 will leave it to you, have no idea about Chinese punctuation

@p0n1
Copy link
Owner

p0n1 commented Aug 26, 2024

@p0n1 will leave it to you, have no idea about Chinese punctuation

Got it. I will try it this week.

@p0n1
Copy link
Owner

p0n1 commented Sep 5, 2024

Good improvements to the Chinese text splitting logic. The regex-based sentence splitting is more appropriate for Chinese language processing. There might be rare edge cases (e.g., very long sentences without punctuation) that could produce unexpected results. I'll test this implementation locally for a while to check for such cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants