Explore other LLMs evaluation metrics #590

dcecchini · 2023-07-04T21:23:21Z

We can find other evaluation metrics and approaches for LLMs. One source is this blogpost.

dcecchini · 2023-07-26T11:28:47Z

Anther source is this new tool: FLASK

ArkajyotiChakraborty · 2023-08-06T10:37:20Z

Hey @dcecchini I would like to contribute to this issue.

dcecchini · 2023-08-06T13:25:11Z

Hi @ArkajyotiChakraborty , good! Seems that some of those approaches may be easy to implement. Let me know how it goes.

ArkajyotiChakraborty · 2023-08-06T13:56:06Z

Yeah, let me check and start around tomorrow. Will keep you updated, are you considering any particular deadline for this?

dcecchini · 2023-08-06T14:28:28Z

I think the deadline will depend on your findings of what to implement and how complex it is. Let's analyze it and make a plan.

alytarik · 2023-10-13T10:05:31Z

@dcecchini
I added my findings in the exploration file in teams, but here is also the takeaways. I believe QG-QA would be nice to add and I opened an issue for it (#817). For others, Entailment score can be used in summarization tasks and we can make the score to use configurable (currently I think default is rouge and cant be changed)

Main takeaways:
BERTScore: Using pairwise cosine similarity between reference and result tokens to calculate prec, recall, f1 etc.
WER: Uses edit distance between candidate and reference text by counting the number of insertions, deletions, substitutions to transform.
MoverScore: Calculated by comparing the similarity of token movements between the source and target texts.
Entailment Score: Useful for ensuring faithfulness in text-grounded generation tasks like text summarization.
G-Eval: Ask the model directly to score between 0-5 for some aspects, generally biased
QG-QA: Use a model to generate questions given context: mrm8488/t5-base-finetuned-question-generation-ap
GitHub: https://github.com/orhonovich/q-squared

dcecchini added the ⏭️ Next Release Issues or Request for the next release label Aug 7, 2023

ArshaanNazir assigned alytarik and ArshaanNazir and unassigned ArshaanNazir Sep 5, 2023

ArshaanNazir added v2.1.0 Issue or request to be done in v2.1.0 release and removed ⏭️ Next Release Issues or Request for the next release labels Sep 6, 2023

ArshaanNazir removed the v2.1.0 Issue or request to be done in v2.1.0 release label Oct 17, 2023

alytarik closed this as completed Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore other LLMs evaluation metrics #590

Explore other LLMs evaluation metrics #590

dcecchini commented Jul 4, 2023 •

edited

Loading

dcecchini commented Jul 26, 2023

ArkajyotiChakraborty commented Aug 6, 2023

dcecchini commented Aug 6, 2023

ArkajyotiChakraborty commented Aug 6, 2023

dcecchini commented Aug 6, 2023

alytarik commented Oct 13, 2023

Explore other LLMs evaluation metrics #590

Explore other LLMs evaluation metrics #590

Comments

dcecchini commented Jul 4, 2023 • edited Loading

dcecchini commented Jul 26, 2023

ArkajyotiChakraborty commented Aug 6, 2023

dcecchini commented Aug 6, 2023

ArkajyotiChakraborty commented Aug 6, 2023

dcecchini commented Aug 6, 2023

alytarik commented Oct 13, 2023

dcecchini commented Jul 4, 2023 •

edited

Loading