Skip to content

RAG-Bench is to summarize all datasets used to evaluate RAG, from document retrieval to question answering.

License

Notifications You must be signed in to change notification settings

gomate-community/rag-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

rag-bench

RAG-Bench is to summarize all datasets used to evaluate RAG, from document retrieval to question answering. For each dataset, we try our best to accumulate all SOTA results from latest publications, and summarize them into a coherent table. We hope this would help researchers to keep up with the latest deveopments on each benchmark.

Benchmarks

Here are datasets with the corresponding metrics.

Task Dataset Pubyear Documents Questions Answers Metrics
Factoid QA Natural Questions (NQ) 2019 Wikipedia 323,045 questions with each an wikipedia page paragraph/span Rouge, EM
TriviaQA 2017 662,659 evidence documents 95,956 QA pairs text string (92.85% wikipedia titles) EM
NarrativeQA (NQA) 2017 1,572 stories (books,movie scripts) \& human generated summaries 46,765 human generated questions human written, short, averaging 4.73 tokens Rouge
SQuAD 2016 536 articles 107,785 question-answer pairs spans EM
PopQA 2023 wikipedia 14k questions long-tail entites EM
HellaSwag 2019 25k Activity contexts and 45k WikiHow contexts 70k examples classification Accuracy
StrategyQA 2021 wikipedia (1,799 Wikipedia terms) 2,780 strategy questions its decomposition, evidence paragraphs EM
Fermi 2021 - 928 FPs (a question Q, an answer A, supporting facts F, an explanation P) spans Accuracy
Multi-Hop QA 2WikiMultihopQA 2020 articles from wikipedia and wikidata 192,606 questions each with a context textual spans, sentence-level supporting facts, evidence (tiples) F1
HotpotQA 2018 The whole wikipedia dump 112,779 question-answer pairs text span F1
Long-Form QA ELI5 2019 250 billion pages from Common Crawl 272K questions multiple sentences Citation Recall, Citation Precision, Claim Recall
WikiEval 2023 50 wikipedia pages 50 questions text spans (sentences) Ragas
ASQA 2022 wikipedia 6,316 ambiguous factoid questions long-form answers disambig F1, RougeL, EM
WebGLM-QA 2023 - 44979 samples sentences RougeL, Citation Recall, Citation Precision
Multiple Choice QA TruthfulQA 2021 - 817 questions that span 38 categories sentence answer/multiple choice EM
MMLU 2021 - 15,908 multiple-choice questions 4-way multiple choice Accuracy
OpenBook QA 2018 7326 facts from a book 5,957 questions 4-way multiple-choice Accuracy
QuALITY (QLTY) 2022 - 6,737 questions 4-way multiple choices Accuracy
Open-Domain Summarization WikiAsp 2021 Wikipedia articles from 20 different domains 320,272 samples 1) aspect selection (section title), 2) summary generation (section paragraph) ROUGE, F1, UniEval
Fact-checking Scifact 2020 5,183 abstracts 1409 claim-abstract pairs 3-class classification (support/refutes/Noinfo) nDCG@10
FEVER 2018 50,000 popular pages from wikipedia 185,445 claims 3-class classification Accuracy
Feverous 2021 wikipedia 87,026 claims 3-class classification/evidence retrieval Accuracy

About

RAG-Bench is to summarize all datasets used to evaluate RAG, from document retrieval to question answering.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published