This repo provides guidelines for training and testing retrieval-based baselines for NeurIPS Competition on Efficient Open-domain Question Answering.
We provide two retrieval-based baselines:
- TF-IDF: TF-IDF retrieval built on fixed-length passages, adapted from the DrQA system's implementation.
- DPR: A learned dense passage retriever, detailed in Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al, 2020). Our baseline is adapted from the original implementation.
Note that for both baselines, we use text blocks of 100 words as passages and a BERT-base multi-passage reader. See more details in the DPR paper.
We provide two variants for each model, using (1) full Wikipedia (full
) and (2) A subset of Wikipedia articles which are found relevant to the questions on the train data (subset
). In particular, we think of subset
as a naive way to reduce the disk memory usage for the retrieval-based baselines.
Note: If you want to try parameter-only baselines (T5-based) for the competition, please note that implementations on T5-based Closed-book QA model is available here.
Note: If you want simple guidelines on making end-to-end QA predictions using pretrained models, please refer to this tutorial.
git clone https://github.com/facebookresearch/DPR.git # dependency
git clone https://github.com/efficientqa/retrieval-based-baselines.git # this repo
Follow DPR repo in order to download NQ data and Wikipedia DB. Specificially, after running cd DPR
and let base_dir
as your base directory to store data and pretrained models,
- Download QA pairs by
python3 data/download_data.py --resource data.retriever.qas --output_dir ${base_dir}
andpython3 data/download_data.py --resource data.retriever.nq --output_dir ${base_dir}
. - Download wikipedia DB by
python3 data/download_data.py --resource data.wikipedia_split --output_dir ${base_dir}
. - Download gold question-passage pairs by
python3 data/download_data.py --resource data.gold_passages_info --output_dir ${base_dir}
.
Optionally, if you want to try subset
variant, run cd ../retrieval-based-baselines; python3 filter_subset_wiki.py --db_path ${base_dir}/data/wikipedia_split/psgs_w100.tsv --data_path ${base_dir}/data/retriever/nq-train.json
. This script will create a new passage DB containing passages which originated articles are those paired with question on the original NQ data (78,050 unique articles; 1,642,855 unique passages).
This new DB will be stored at ${base_dir}/data/wikipedia_split/psgs_w100_subset.tsv
.
From now on, we will denote Wikipedia DBs (either full or subset) as db_path
.
Make sure to be in retrieval-based-baselines
directory to run scripts for TFIDF (largely adapted from DrQA repo).
Step 1: Run pip install -r requirements.txt
Step 2: Build Sqlite DB via:
mkdir -p {base_dir}/tfidf
python3 build_db.py ${db_path} ${base_dir}/tfidf/db.db --num-workers 60`.
Step 3: Run the following command to build TFIDF index.
python3 build_tfidf.py ${base_dir}/tfidf/db.db ${base_dir}/tfidf
It will save TF-IDF index in ${base_dir}/tfidf
Step 4: Run inference code to save retrieval results.
python3 inference_tfidf.py --qa_file ${base_dir}/data/retriever/qas/nq-{train|dev|test}.csv --db_path ${db_path} --out_file ${base_dir}/tfidf/nq-{train|dev|test}.json --tfidf_path {path_to_tfidf_index}
The resulting files, ${base_dir}/tfidf/nq-{train|dev|test}-tfidf.json
are ready to be fed into the DPR reader.
Follow DPR repo to train DPR retriever and make inference. You can follow steps until Retriever validation.
If you want to use retriever checkpoint provided by DPR, follow these three steps.
Step 1: Make sure to be in DPR
directory, and download retriever checkpoint by python3 data/download_data.py --resource checkpoint.retriever.multiset.bert-base-encoder --output_dir ${base_dir}
.
Step 2: Save passage vectors by following Generating representations. Note that you can replace ctx_file
to your own db_path
if you are trying "seen only" version. In particular, you can do
python3 generate_dense_embeddings.py --model_file ${base_dir}/checkpoint/retriever/multiset/bert-base-encoder.cp --ctx_file ${db_path} --shard_id {0-19} --num_shards 20 --out_file ${base_dir}/dpr_ctx
Step 3: Save retrieval results by following Retriever validation. In particular, you can do
mkdir -p ${base_dir}/dpr_retrieval
python3 dense_retriever.py \
--model_file ${base_dir}/checkpoint/retriever/single/nq/bert-base-encoder.cp \
--ctx_file ${dp_path} \
--qa_file ${base_dir}/data/retriever/qas/nq-{train|dev|test}.csv \
--encoded_ctx_file ${base_dir}/'dpr_ctx*' \
--out_file ${base_dir}/dpr_retrieval/nq-{train|dev|test}.json \
--n-docs 200 \
--save_or_load_index # this to save the dense index if it was built for the first time, and load it next times.
Now, ${base_dir}/dpr_retrieval/nq-{train|dev|test}.json
is ready to be fed into DPR reader.
Note: The following instruction is identical to instructions from DPR README, but we rewrite it with hyperparamters specified for our baselines.
The following instruction is for training the reader using TFIDF results, saved in ${base_dir}/tfidf/nq-{train|dev|test}-tfidf.json
. In order to use DPR retrieval results, simply replace paths to these files to ${base_dir}/dpr_retrieval/nq-{train|dev|test}.json
Step 1: Preprocess data.
python3 preprocess_reader_data.py \
--retriever_results ${base_dir}/tfidf/nq-{train|dev|test}.json \
--gold_passages ${base_dir}/data/gold_passages_info/nq_{train|dev|test}.json \
--do_lower_case \
--pretrained_model_cfg bert-base-uncased \
--encoder_model_type hf_bert \
--out_file ${base_dir}/tfidf/nq-{train|dev|test}-tfidf \
--is_train_set # specify this only when it is train data
Step 2: Train the reader.
python3 train_reader.py \
--encoder_model_type hf_bert \
--pretrained_model_cfg bert-base-uncased \
--train_file ${base_dir}/tfidf/'nq-train*.pkl' \
--dev_file ${base_dir}/tfidf/'nq-dev*.pkl' \
--output_dir ${base_dir}/checkpoints/reader_from_tfidf \
--seed 42 \
--learning_rate 1e-5 \
--eval_step 2000 \
--eval_top_docs 50 \
--warmup_steps 0 \
--sequence_length 350 \
--batch_size 16 \
--passages_per_question 24 \
--num_train_epochs 100000 \
--dev_batch_size 72 \
--passages_per_question_predict 50
Step 3: Test the reader.
python train_reader.py \
--prediction_results_file ${base_dir}/checkpoints/reader_from_tfidf/dev_predictions.json \
--eval_top_docs 10 20 40 50 80 100 \
--dev_file ${base_dir}/tfidf/`nq-dev*.pkl` \
--model_file ${base_dir}/checkpoints/reader_from_tfidf/{checkpoint file} \
--dev_batch_size 80 \
--passages_per_question_predict 100 \
--sequence_length 350
Model | Exact Mach | Disk usage (gb) |
---|---|---|
TFIDF-full | 32.0 | 20.1 |
TFIDF-subset | 31.0 | 2.8 |
DPR-full | 41.0 | 66.4 |
DPR-subset | 34.8 | 5.9 |