Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore other LLMs evaluation metrics #590

Closed
dcecchini opened this issue Jul 4, 2023 · 6 comments
Closed

Explore other LLMs evaluation metrics #590

dcecchini opened this issue Jul 4, 2023 · 6 comments
Assignees

Comments

@dcecchini
Copy link
Contributor

dcecchini commented Jul 4, 2023

We can find other evaluation metrics and approaches for LLMs. One source is this blogpost.

@dcecchini
Copy link
Contributor Author

Anther source is this new tool: FLASK

@ArkajyotiChakraborty
Copy link
Contributor

Hey @dcecchini I would like to contribute to this issue.

@dcecchini
Copy link
Contributor Author

Hi @ArkajyotiChakraborty , good! Seems that some of those approaches may be easy to implement. Let me know how it goes.

@ArkajyotiChakraborty
Copy link
Contributor

Yeah, let me check and start around tomorrow. Will keep you updated, are you considering any particular deadline for this?

@dcecchini
Copy link
Contributor Author

I think the deadline will depend on your findings of what to implement and how complex it is. Let's analyze it and make a plan.

@dcecchini dcecchini added the ⏭️ Next Release Issues or Request for the next release label Aug 7, 2023
@ArshaanNazir ArshaanNazir added v2.1.0 Issue or request to be done in v2.1.0 release and removed ⏭️ Next Release Issues or Request for the next release labels Sep 6, 2023
@alytarik
Copy link
Contributor

@dcecchini
I added my findings in the exploration file in teams, but here is also the takeaways. I believe QG-QA would be nice to add and I opened an issue for it (#817). For others, Entailment score can be used in summarization tasks and we can make the score to use configurable (currently I think default is rouge and cant be changed)

Main takeaways:
BERTScore: Using pairwise cosine similarity between reference and result tokens to calculate prec, recall, f1 etc.
WER: Uses edit distance between candidate and reference text by counting the number of insertions, deletions, substitutions to transform.
MoverScore: Calculated by comparing the similarity of token movements between the source and target texts.
Entailment Score: Useful for ensuring faithfulness in text-grounded generation tasks like text summarization.
G-Eval: Ask the model directly to score between 0-5 for some aspects, generally biased
QG-QA: Use a model to generate questions given context: mrm8488/t5-base-finetuned-question-generation-ap
GitHub: https://github.com/orhonovich/q-squared

@ArshaanNazir ArshaanNazir removed the v2.1.0 Issue or request to be done in v2.1.0 release label Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants