Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Not Supported #1783

Open
yusufsyaifudin opened this issue Dec 22, 2024 · 2 comments
Open

Language Not Supported #1783

yusufsyaifudin opened this issue Dec 22, 2024 · 2 comments
Labels
question Further information is requested

Comments

@yusufsyaifudin
Copy link

[x] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
I want to use RAGAS as my RAG evaluation framework, but I cannot find supported language other than RAGAS_SUPPORTED_LANGUAGE_CODES in this line https://github.com/explodinggradients/ragas/blob/v0.2.8/src/ragas/metrics/base.py#L707

Which after tracing the code, it come from here:

The pySBD last commit is 3 years ago, which I also have question why prefer use that library?

My ultimate question is: How to add language support which don't supported by pySBD (and not supported by RAGAS)?
I see that the list is too limited, not a single language from Southeast Asia is supported.

Additional context
If I can extend the language support which don't "natively" supported by RAGAS, where I can find the example to create an Adapter Language?

Thank you!

@yusufsyaifudin yusufsyaifudin added the question Further information is requested label Dec 22, 2024
@jjmachan
Copy link
Member

jjmachan commented Jan 7, 2025

hey @yusufsyaifudin thanks for sharing this - which language are you are you planning to use? other that pySBD which other tools do you work with that have support which you mentioned?

@shahules786 should be able to provide you better information too

@yusufsyaifudin
Copy link
Author

Thanks @jjmachan for your reply

which language are you are you planning to use?

I am work with Bahasa Indonesia and have tried to run the RAGAs with default settings (I assume it English) in three proprietary model: claude-3-haiku-20240307, claude-3-5-haiku-20241022 and claude-3-5-sonnet-20241022.

The claude-3-haiku-20240307 always return the faithfulness score to 1.0 (I only test with two data) which I can confirm that it should be near to 0. The other two models return 0.0, at this point I starting to think that maybe it just because the Haiku old version is "bad" at reasoning.

But, I think by using the same language in the prompt for testing (Bahasa Indonesia in my case), probably it would be have better reasoning.

other that pySBD which other tools do you work with that have support which you mentioned?

Actually I don't know any alternative, maybe we still in the state that none package supports all language for sentence boundary extractor.

But, imho, if we can create some "abstraction" regarding the sentence segmentation and prompt, we can achieve multi-language support easily? Probably using nltk, or other package.

For example, in my project I use https://github.com/yusufsyaifudin/id-sentence-segmenter which forked version from the https://yudanta.github.io/posts/indonesian-simple-sentence-segmentation/ (which is part of his theses work https://etd.repository.ugm.ac.id/penelitian/detail/103174).

If I want to extend or create abstraction for this, which file and line of code as the the starting point that I can read? Maybe @shahules786 can help me to point this out.

🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants