paragraph level question generation
Get the data under the data
directory.
git clone https://github.com/hongweizeng/paragraph-level-QG
cd paragraph-level-QG
mkdir data
Learning to Ask: Neural Question Generation for Reading Comprehension. ACL 2017. [Github]
Du et al., ACL 2017 (70484 | 10570 | 11877): We use the original dev* set in the SQuAD dataset as our dev set, we split the original training* set into our training set and test set.
Zhao et. al, EMNLP 2018 [Reversed dev-test setup]: (70484 | 11877 | 10570) + (dropped samples): we use dev* set as test set, and split train* set into train and dev sets randomly with ratio 90%-10%. we keep all samples instead of only keeping the sentence-question pairs that have at least one non-stop-word in common (with 6.7% pairs dropped) as in (Du et al., 2017).
cd data
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
mkdir squad_split_v1
git clone https://github.com/xinyadu/nqg
cp nqg/data/raw/* squad_split_v1/
cd squad_split_v1
python convert_squad_split1_qas_id.py
cd ..
python preprocess.py -data_dir data -dataset squad_split_v1
Neural Question Generation from Text: A Preliminary Study. NLPCC 2017. [Github] [Data]
Zhou et al., NLPCC 2017 (86,635 | 8,965 | 8,964): Randomly halve the development set to construct the new development and test sets.
Zhao et. al, EMNLP 2018 (? | ? | ?): Similar to (Zhou et al., 2017), we split dev* set into dev and test sets randomly with ratio 50%-50%. The split is done at sentence level.
Tuan et al., AAAI 2020 (87,488 | 5,267 | 5,272): similar to (Zhao et al., 2018), we keep the SQuAD train set and randomly split the SQuAD dev set into our dev and test set with the ratio 1:1.
cd data
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
mkdir squad_split_v2
wget http://res.qyzhou.me/qas_id_in_squad.zip
unzip qas_id_in_squad.zip
cp qas_id_in_squad/train.txt.id squad_split_v2/
cp qas_id_in_squad/dev.txt.shuffle.dev.id squad_split_v2/dev.txt.id
cp qas_id_in_squad/dev.txt.shuffle.test.id squad_split_v2/test.txt.id
cd ..
python preprocess.py -data_dir data -dataset squad_split_v2
NewsQA: A Machine Comprehension Dataset. Rep4NLP@ACL 2017.
Liu et al., WWW 2019 (77,538 | 4,341 | 4,383): In our experiment, we picked a subset of NewsQA where answers are top-ranked and are composed of a contiguous sequence of words within the input sentence of the document.
Tuan et al., AAAI 2020 (76,560 | 4,341 | 4,292): In our experiment, we select the questions in NewsQA where answers are sub-spans within the articles. As a result, we obtain a dataset with 76k questions for train set, and 4k questions for each dev and test set.
[Ours] ((92,549 | 5,166 | 5,126) = Consensus Statistics: 102,841)
Following the README.md
in https://github.com/Maluuba/newsqa to download newsqa.tar.gz
, cnn_stories.tgz
and stanford-postagger-2015-12-09.zip
into maluuba/newsqa
folder; and use the Maluuba's tool to split data as follow.
cd data
git clone https://github.com/Maluuba/newsqa
cd newsqa
conda create --name newsqa python=2.7 "pandas>=0.19.2"
conda activate newsqa && pip install --requirement requirements.txt
python maluuba/newsqa/data_generator.py
Then, we will have train.tsv
, dev.tsv
and test.tsv
in datasets/newsqa/split_data
folder.
cd ..
python preprocess.py -data_dir data -dataset newsqa
Train & test the model with specified configuration, e.g. test.yml
file.
python main.py -train -test -config configs/test.yml
Or you can test with specified configuration & checkpoint.
python main.py -test -config *.yml -test_from_model *.ckpt
[1]. https://github.com/magic282/NQG
[2]. https://github.com/seanie12/neural-question-generation
If you find this code is helpful, please cite our paper:
@article{zeng-etal-2021-EANQG,
title = {Improving Paragraph-level Question Generation with Extended Answer Network and Uncertainty-aware Beam Search},
author = {Hongwei Zeng, Zhuo Zhi, Jun Liu and Bifan Wei},
url = {https://github.com/hongweizeng/paragraph-level-QG},
booktitle = {Information Sciences},
year = {2021}
}