This page contains instructions for running BM25 baselines on the FEVER fact verification task.
We are going to use the repository's root directory as the working directory.
First, we need to download and extract the FEVER dataset:
mkdir collections/fever
mkdir indexes/fever
wget -P collections/fever
unzip collections/fever/ -d collections/fever
wget -P collections/fever
wget -P collections/fever
To confirm,
should have MD5 checksum of ed8bfd894a2c47045dca61f0c8dc4c07
Next, we want to index the Wikipedia dump (
) using Anserini. Note that this Wikipedia dump consists of Wikipedia articles' introductions only, which we will refer to as "paragraphs" from this point onward.
We will consider two variants: (1) Paragraph Indexing and (2) Sentence Indexing.
We can index paragraphs with FeverParagraphCollection
, as follows:
sh target/appassembler/bin/IndexCollection \
-collection FeverParagraphCollection -generator DefaultLuceneDocumentGenerator \
-threads 9 -input collections/fever/wiki-pages \
-index indexes/fever/lucene-index-fever-paragraph -storePositions -storeDocvectors -storeRaw
Upon completion, we should have an index with 5,396,106 documents (paragraphs).
We can index sentences with FeverSentenceCollection
, as follows:
sh target/appassembler/bin/IndexCollection \
-collection FeverSentenceCollection -generator DefaultLuceneDocumentGenerator \
-threads 9 -input collections/fever/wiki-pages \
-index indexes/fever/lucene-index-fever-sentence -storePositions -storeDocvectors -storeRaw
Upon completion, we should have an index with 25,247,887 documents (sentences).
Note that while we use paragraph indexing for this section, these steps can easily be modified for sentence indexing.
Before we can retrieve with our index, we need to generate the queries and qrels files for the dev split of the FEVER dataset:
python src/main/python/fever/ \
--dataset_file collections/fever/paper_dev.jsonl \
--output_queries_file collections/fever/ \
--output_qrels_file collections/fever/ \
--granularity paragraph
We can now perform a retrieval run:
python tools/scripts/msmarco/ \
--hits 1000 --threads 1 \
--index indexes/fever/lucene-index-fever-paragraph \
--queries collections/fever/ \
--output runs/
Note that by default, the above script uses BM25 with tuned parameters k1=0.82
, b=0.68
Finally, we can evaluate the retrieved documents using the official TREC evaluation tool, trec_eval
We first need to convert the runs and qrels files to the TREC format:
python tools/scripts/msmarco/ \
--input runs/ \
--output runs/
python tools/scripts/msmarco/ \
--input collections/fever/ \
--output collections/fever/
Then we run the trec_eval
tools/eval/trec_eval.9.0.4/trec_eval -c -m all_trec \
collections/fever/ runs/
Within the output, we should see:
recall_1000 all 0.9417
We can also evaluate our retrieval compared to the TF-IDF baseline described in the FEVER paper. Specifically, we want to compare the metrics described in Table 2 of the paper.
We evaluate the run file produced earlier:
python src/main/python/fever/ \
--truth_file collections/fever/paper_dev.jsonl \
--run_file runs/
This run produces the following results:
k | Fully Supported | Oracle Accuracy |
1 | 0.3272 | 0.5515 |
5 | 0.5656 | 0.7104 |
10 | 0.6542 | 0.7695 |
25 | 0.7459 | 0.8306 |
50 | 0.8098 | 0.8732 |
100 | 0.8561 | 0.9041 |
The above retrieval uses the MS MARCO default BM25 parameters of k1=0.82
, b=0.68
. We can tune these parameters to outperform the results of the TF-IDF baseline in the paper.
We tune on a subset of the training split of the dataset. We generate that subset:
python src/main/python/fever/ \
--dataset_file collections/fever/train.jsonl \
--subset_file collections/fever/train-subset.jsonl
We then generate the queries and qrels files for this subset.
python src/main/python/fever/ \
--dataset_file collections/fever/train-subset.jsonl \
--output_queries_file collections/fever/queries.paragraph.train-subset.tsv \
--output_qrels_file collections/fever/qrels.paragraph.train-subset.tsv \
--granularity paragraph
We tune the BM25 parameters with a grid search of parameter values in 0.1 increments. We save the run files generated by this process to a new folder runs/fever-bm25
(do not use runs
python src/main/python/fever/ \
--runs_folder runs/fever-bm25 \
--index_folder indexes/fever/lucene-index-fever-paragraph \
--queries_file collections/fever/queries.paragraph.train-subset.tsv \
--qrels_file collections/fever/qrels.paragraph.train-subset.tsv
From the grid search, we observe that the parameters k1=0.6
, b=0.5
perform fairly well. If we retrieve on the dev set with these parameters:
python tools/scripts/msmarco/ \
--hits 1000 --threads 1 \
--index indexes/fever/lucene-index-fever-paragraph \
--queries collections/fever/ \
--output runs/ \
--k1 0.6 --b 0.5
and we evaluate this run file:
python src/main/python/fever/ \
--truth_file collections/fever/paper_dev.jsonl \
--run_file runs/
then we can achieve the following results:
k | Fully Supported | Oracle Accuracy |
1 | 0.3857 | 0.5905 |
5 | 0.6367 | 0.7578 |
10 | 0.7193 | 0.8129 |
25 | 0.8003 | 0.8669 |
50 | 0.8473 | 0.8982 |
100 | 0.8804 | 0.9203 |
which outperforms the TF-IDF baseline in the FEVER paper at every tested value of k.
- Results replicated by @LizzyZhang-tutu on 2020-11-26 (commit