ALCE-ELI5 BENCHMARK

1. Description

ALCE-ELI5 is different from the original ELI5 dataset. The description of the original paper^[1] on this dataset is as follows.

ELI5^[2] is a long-form QA dataset built on the Reddit forum “Explain Like I’m Five”, with an average answer length of 131 words. Most ELI5 questions are how/why/what questions that require in-depth long answers and multiple passages as evidence. Due to the diverse range of topics discussed in the questions, we use Sphere^[3]--a filtered version of Common Crawl7—as the corpus. The ELI5 dataset is widely used in related work due to its challenging nature^[4][5][6].

We randomly select 1,000 examples from the development set of each dataset for ALCE. Our benchmark primarily assesses the citation capabilities of existing LLMs and does not provide training data, as there are no available examples that provide supervision for citations in these datasets.

2. Dataset

The ALCE-ELI5 dataset consists of two versions: BM25 and Oracle. Both versions contain 1000 data entries with consistent questions. In the BM25 version, each entry provides the top-100 passages retrieved using BM25 from the Sphere corpus. The Oracle version approximates golden passages through passages reranking, resulting in higher-quality passages compared to the BM25 version. Each passage is segmented into chunks of 100 words.

The original data only consists of long answers. To facilitate evaluation, the authors used InstructGPT (text-davinci-003) to generate three sub-claims for each answer.

Additionally, each data entry includes annotated versions of Summary and Snippet generated by ChatGPT.

{
    "question": "How are firms like snapchat, uber etc valued so highly while still not making a profit? Do venture capitalists not expect some form of repayment within a number of years?",
    "question_ctx": "[removed]",
    "answer": "Yes. Did you watch The Social Network? They went a while before introducing ads, so they could make money, as they needed to establish their brand and amass users. Once you have dedicated users, introducing ads won't deter most, but if you are still new, having ads will deter a lot. The same goes for Uber, it's not that they aren't making money, it's that they are reinvesting a ton of it to make their service better.",
    "claims": [
        "Firms like Snapchat and Uber need to establish their brand and amass users before introducing ads.",
        "Introducing ads too early can deter potential users.",
        "Uber is reinvesting a lot of money to make their service better."
    ],
    "docs": [
        {
            "title": "Is Snapchat really worth $19 billion? - CSMonitor.com",
            "text": "reporting that the Los Angeles-based company is aiming to raise $500 million at a valuation of $16 billion to $19 billion, making it the third most highly valued tech start-up backed by venture capitalists. The Chinese handset maker Xiaomi is valued at $45 billion, while Uber is estimated to be valued at about $40 billion, according to data from CB Insights. Read MoreVC investment hits $86B thanks to Uber, Xiaomi Snapchat was valued at $10 billion in August, according to a Dow Jones report. Some of its investors from previous rounds include Benchmark, Lightspeed Venture Partners and Kleiner Perkins Caufield",
            "url": "https://www.csmonitor.com/Business/Latest-News-Wires/2015/0218/Is-Snapchat-really-worth-19-billion",
            "summary": "Snapchat is aiming to raise $500 million with a valuation of $16 billion to $19 billion, making it the third most highly valued tech start-up backed by venture capitalists. Other highly valued companies include Xiaomi at $45 billion and Uber at about $40 billion. Snapchat was previously valued at $10 billion, and some of its investors include Benchmark, Lightspeed Venture Partners, and Kleiner Perkins Caufield. The article does not discuss whether venture capitalists expect repayment within a certain timeframe.",
            "extraction": "Venture capitalists invest in startups with the expectation of a big payout down the road, either through an initial public offering or an acquisition. This means that the investors expect the company to eventually become profitable and generate returns on their investment. However, in the case of firms like Snapchat and Uber, they are valued highly because of their potential for future growth and dominance in their respective markets, rather than current profitability. Therefore, venture capitalists may not expect repayment within a specific number of years but instead expect a significant"
        },
        {
            "title": "What Are Venture Capital Investments? \u2013 DollarsAndSense.my",
            "text": "Ever wondered how highly valued technology giants like Google and Facebook were able to grow so fast and pay their employees so well in such a short amount of time, or how still growing start-ups like Uber are able to lose 1.2 billion US dollars in just the first half of this year alone and still command a valuation upwards of 50 billion US dollars? The answer lies with a special category of investment activity known as venture capital. Venture capitalists are professional investors who invest in a number of highly scalable high-risk technology ventures hoping to make a multi-fold",
            "url": "http://dollarsandsense.my/what-are-venture-capital-investments/",
            "summary": "Venture capitalists invest in highly scalable high-risk technology ventures, such as Snapchat and Uber, hoping to make a multi-fold return on their investment. This explains how firms can be valued highly despite not making a profit.",
            "extraction": "The reason why firms like Uber can command such high valuations despite not making a profit is due to venture capital investments. Venture capitalists are professional investors who invest in high-risk technology ventures with the hopes of making a multi-fold return on their investment. Therefore, they do expect some form of repayment within a number of years but are willing to take on the risk in exchange for the potential high returns."
        },
        ...
    ]
}

3. Metrics

In the original paper, the authors evaluated four metrics on the ELI5 dataset: MAUVE, Claim Recall, Citation Recall, and Citation Precision.

MAUVE^[7]: MAUVE measures the similarity between two text distributions. Specifically, we concatenate the question and the model output and compare it to the distribution of question-gold-answer concatenation. We will add this metric in future work.
Claim Recall: Claim Recall measures the correctness of long-form answers. In the original paper, the author first use Instruct-GPT(text-davinci-003) to generate three "sub-claims" (based on gold answers) and use a state-of-the-art natural-language inference (NLI) model TRUE^[8] to check whether the model output entails the sub-claims.
Citation Recall: Citation Recall determines if the result from LLM is entirely supported by cited passages.
Citation Precision: Citation Precision detects citations that are irrelevant to the claim, but it does not require citing a minimal set and it permits citing redundant passages entailing similar claims.

Additionally, metrics such as ROUGE were also included in the evaluation, but no performance comparisons were made.

4. Usage

4.1 Prepare

Download and unzip the dataset, and install the RagEval tool.

script_dir=$(cd $(dirname $0);pwd)
cache_dir=$(dirname $(dirname $(dirname $script_dir)))/.rageval
wget -cP $cache_dir/datasets https://huggingface.co/datasets/princeton-nlp/ALCE-data/resolve/main/ALCE-data.tar
tar -xvf $cache_dir/datasets/ALCE-data.tar -C $cache_dir/datasets
python3 setup.py install

4.2 Generate examples

Replace api_key to your OpenAI api key in run.sh then run it to generate gpt-3.5-turbo response. The command is as follows:

python3 $script_dir/generate.py\
  --cache_path $cache_dir\
  --model gpt-3.5-turbo\
  --api_key "YOUR_API_KEY"\
  --dataset bm25\
  --method vanilla\
  --ndoc 5\
  --shot 2

You can also use local models for generation.

python3 $script_dir/generate.py\
  --cache_path $cache_dir\
  --model Llama-2-7b-chat-hf\
  --dataset bm25\
  --method vanilla\
  --ndoc 5\
  --shot 2

Arguements:

--cache_path: The script automatically calculates the cache_path, so users generally don't need to specify it. The default path is the .rageval directory.
--model: The model's name, currently supports OpenAI's API as well as open-source models like LLAMA.
--api_key: When using OpenAI's API, the api_key parameter is required, otherwise, please ignore this parameter.
--max_length: The maximum token count supported by the model.
--temperature/top_p: The inference hyperparameters for LLM.
--dataset: Version of the dataset: bm25 or oracle.
--method: Supports three generation methods: vanilla, summary, and snippet. The difference lies in the processing of the input documents.
--ndoc: The number of documents provided to LLM during the generation process.
--shot: The number of examples provided to LLM during the generation process.

4.3 Evaluation

You can download pre-generated result file from HuggingFace for evaluation.

python3 $script_dir/eli5_benchmark.py\
  --cache_path $cache_dir\
  --remote_split Llama_2_7b_chat_hf_vanilla_shot2_ndoc5

You can also specify locally saved result file for evaluation.

python3 $script_dir/eli5_benchmark.py\
  --cache_path $cache_dir\
  --local_file "YOUR_LOCAL_FILE"

Arguements:

--cache_path: The script automatically calculates the cache_path, so users generally don't need to specify it. The default path is the .rageval directory.
--remote_split: Download pplit dataset from our huggingface dataset to evaluate.
--local_file: Specify locally saved result file to evaluate.

5. Performance

5.1 BM25

Model	Method	MAUVE	Claim Recall	Citation Recall	Citation Precision
Llama-2-7b-chat	vanilla(5-psg)	--	11.50	26.62	74.55
Llama-2-7b-chat	summary(5-psg)	--	--	--	--
Llama-2-7b-chat	summary(10-psg)	--	--	--	--
Llama-2-7b-chat	snippet(5-psg)	--	--	--	--
Llama-2-7b-chat	snippet(10-psg)	--	--	--	--

5.2 Oracle

Model	Method	MAUVE	Claim Recall	Citation Recall	Citation Precision
Llama-2-7b-chat	vanilla(5-psg)	--	17.76	34.01	75.64

6 References

[1] Gao Tianyu, Yen Howard, Yu Jiatong and Chen Danqi. 2023. Enabling Large Language Models to Generate Text with Citations. Empirical Methods in Natural Language Processing (EMNLP).

[2] Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy.

[3] Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih and Sebastian Riedel. 2022. The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus. arXiv preprint arXiv:2112.09924.

[4] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess and John Schulman. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.

[5] Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving and Nat McAleese. 2022. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.

[6] Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.

[7] Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. 2021. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems.

[8] Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re-evaluating factual consistency evaluation. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 3905–3920.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ALCE-ELI5 BENCHMARK

1. Description

2. Dataset

3. Metrics

4. Usage

4.1 Prepare

4.2 Generate examples

4.3 Evaluation

5. Performance

5.1 BM25

5.2 Oracle

6 References

Files

README.md

Latest commit

History

README.md

File metadata and controls

ALCE-ELI5 BENCHMARK

1. Description

2. Dataset

3. Metrics

4. Usage

4.1 Prepare

4.2 Generate examples

4.3 Evaluation

5. Performance

5.1 BM25

5.2 Oracle

6 References