This is the repository for the paper: Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering - EACL 2023 (Findings).
We use two datasets in our experiments: 2WikiMultihopQA and HotpotQA-small
- Pre-process data (file .gz) for dev and train of 2Wiki (Please download raw data from the Github repository of the 2WikiMultihopQA dataset)
- Raw data and pre-process data for dev and train of HotpotQA-small
- Debiased data
- Adversarial data
We follow the steps in https://github.com/yuwfan/HGN to obtain file .gz data from raw data.
bash install_packages.sh
- Download bigbird-roberta-base model from this link: https://huggingface.co/google/bigbird-roberta-base
- Edit variables: data_dir, pretrained_model_dir, data_file
- Run:
python3 preprocess.py
python3 main.py
python3 predictor.py $checkpoint $data_file
python3 postprocess.py $prediction_file $processed_data_file $original_data_file
python3 official_evaluation.py path/to/prediction path/to/gold
- Download our checkpoints
- Run file
predict_dev_all_settings.sh
(Note: if you want to use this file for the test set in 2Wiki, comment line #25 about evaluation)
- We base on HGN for data preprocessing.
- We re-use the class Example from the HGN model and update it to work with our dataset.