RAG-Bench is to summarize all datasets used to evaluate RAG, from document retrieval to question answering. For each dataset, we try our best to accumulate all SOTA results from latest publications, and summarize them into a coherent table. We hope this would help researchers to keep up with the latest deveopments on each benchmark.
Here are datasets with the corresponding metrics.
Task | Dataset | Pubyear | Documents | Questions | Answers | Metrics |
Factoid QA | Natural Questions (NQ) | 2019 | Wikipedia | 323,045 questions with each an wikipedia page | paragraph/span | Rouge, EM |
TriviaQA | 2017 | 662,659 evidence documents | 95,956 QA pairs | text string (92.85% wikipedia titles) | EM | |
NarrativeQA (NQA) | 2017 | 1,572 stories (books,movie scripts) \& human generated summaries | 46,765 human generated questions | human written, short, averaging 4.73 tokens | Rouge | |
SQuAD | 2016 | 536 articles | 107,785 question-answer pairs | spans | EM | |
PopQA | 2023 | wikipedia | 14k questions | long-tail entites | EM | |
HellaSwag | 2019 | 25k Activity contexts and 45k WikiHow contexts | 70k examples | classification | Accuracy | |
StrategyQA | 2021 | wikipedia (1,799 Wikipedia terms) | 2,780 strategy questions | its decomposition, evidence paragraphs | EM | |
Fermi | 2021 | - | 928 FPs (a question Q, an answer A, supporting facts F, an explanation P) | spans | Accuracy | |
Multi-Hop QA | 2WikiMultihopQA | 2020 | articles from wikipedia and wikidata | 192,606 questions each with a context | textual spans, sentence-level supporting facts, evidence (tiples) | F1 |
HotpotQA | 2018 | The whole wikipedia dump | 112,779 question-answer pairs | text span | F1 | |
Long-Form QA | ELI5 | 2019 | 250 billion pages from Common Crawl | 272K questions | multiple sentences | Citation Recall, Citation Precision, Claim Recall |
WikiEval | 2023 | 50 wikipedia pages | 50 questions | text spans (sentences) | Ragas | |
ASQA | 2022 | wikipedia | 6,316 ambiguous factoid questions | long-form answers | disambig F1, RougeL, EM | |
WebGLM-QA | 2023 | - | 44979 samples | sentences | RougeL, Citation Recall, Citation Precision | |
Multiple Choice QA | TruthfulQA | 2021 | - | 817 questions that span 38 categories | sentence answer/multiple choice | EM |
MMLU | 2021 | - | 15,908 multiple-choice questions | 4-way multiple choice | Accuracy | |
OpenBook QA | 2018 | 7326 facts from a book | 5,957 questions | 4-way multiple-choice | Accuracy | |
QuALITY (QLTY) | 2022 | - | 6,737 questions | 4-way multiple choices | Accuracy | |
Open-Domain Summarization | WikiAsp | 2021 | Wikipedia articles from 20 different domains | 320,272 samples | 1) aspect selection (section title), 2) summary generation (section paragraph) | ROUGE, F1, UniEval |
Fact-checking | Scifact | 2020 | 5,183 abstracts | 1409 claim-abstract pairs | 3-class classification (support/refutes/Noinfo) | nDCG@10 |
FEVER | 2018 | 50,000 popular pages from wikipedia | 185,445 claims | 3-class classification | Accuracy | |
Feverous | 2021 | wikipedia | 87,026 claims | 3-class classification/evidence retrieval | Accuracy |