paragraph-level-QG

paragraph level question generation

1. Preprocess Dataset

Get the data under the data directory.

git clone https://github.com/hongweizeng/paragraph-level-QG
cd paragraph-level-QG
mkdir data

1.1 [Article-level Split] [Du et al., 2017] (SQuAD Split-1)

Learning to Ask: Neural Question Generation for Reading Comprehension. ACL 2017. [Github]

Du et al., ACL 2017 (70484 | 10570 | 11877): We use the original dev* set in the SQuAD dataset as our dev set, we split the original training* set into our training set and test set.

Zhao et. al, EMNLP 2018 [Reversed dev-test setup]: (70484 | 11877 | 10570) + (dropped samples): we use dev* set as test set, and split train* set into train and dev sets randomly with ratio 90%-10%. we keep all samples instead of only keeping the sentence-question pairs that have at least one non-stop-word in common (with 6.7% pairs dropped) as in (Du et al., 2017).

cd data
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
mkdir squad_split_v1
git clone https://github.com/xinyadu/nqg
cp nqg/data/raw/* squad_split_v1/
cd squad_split_v1
python convert_squad_split1_qas_id.py
cd ..
python preprocess.py -data_dir data -dataset squad_split_v1

1.2 [Sentence-level Split] [Zhou et al., 2017] [Zhao et al., 2018] (SQuAD Split-2)

Neural Question Generation from Text: A Preliminary Study. NLPCC 2017. [Github] [Data]

Zhou et al., NLPCC 2017 (86,635 | 8,965 | 8,964): Randomly halve the development set to construct the new development and test sets.

Zhao et. al, EMNLP 2018 (? | ? | ?): Similar to (Zhou et al., 2017), we split dev* set into dev and test sets randomly with ratio 50%-50%. The split is done at sentence level.

Tuan et al., AAAI 2020 (87,488 | 5,267 | 5,272): similar to (Zhao et al., 2018), we keep the SQuAD train set and randomly split the SQuAD dev set into our dev and test set with the ratio 1:1.

cd data
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
mkdir squad_split_v2
wget http://res.qyzhou.me/qas_id_in_squad.zip
unzip qas_id_in_squad.zip
cp qas_id_in_squad/train.txt.id squad_split_v2/
cp qas_id_in_squad/dev.txt.shuffle.dev.id squad_split_v2/dev.txt.id
cp qas_id_in_squad/dev.txt.shuffle.test.id squad_split_v2/test.txt.id
cd ..
python preprocess.py -data_dir data -dataset squad_split_v2

1.3 NewsQA:

NewsQA: A Machine Comprehension Dataset. Rep4NLP@ACL 2017.

Liu et al., WWW 2019 (77,538 | 4,341 | 4,383): In our experiment, we picked a subset of NewsQA where answers are top-ranked and are composed of a contiguous sequence of words within the input sentence of the document.

Tuan et al., AAAI 2020 (76,560 | 4,341 | 4,292): In our experiment, we select the questions in NewsQA where answers are sub-spans within the articles. As a result, we obtain a dataset with 76k questions for train set, and 4k questions for each dev and test set.

[Ours] ((92,549 | 5,166 | 5,126) = Consensus Statistics: 102,841)

Following the README.md in https://github.com/Maluuba/newsqa to download newsqa.tar.gz, cnn_stories.tgz and stanford-postagger-2015-12-09.zip into maluuba/newsqa folder; and use the Maluuba's tool to split data as follow.

cd data
git clone https://github.com/Maluuba/newsqa
cd newsqa
conda create --name newsqa python=2.7 "pandas>=0.19.2"
conda activate newsqa && pip install --requirement requirements.txt
python maluuba/newsqa/data_generator.py

Then, we will have train.tsv, dev.tsv and test.tsv in datasets/newsqa/split_data folder.

cd ..
python preprocess.py -data_dir data -dataset newsqa

2. Train & Test

Train & test the model with specified configuration, e.g. test.yml file.

python main.py -train -test -config configs/test.yml

Or you can test with specified configuration & checkpoint.

python main.py -test -config *.yml -test_from_model *.ckpt

References

[1]. https://github.com/magic282/NQG

[2]. https://github.com/seanie12/neural-question-generation

Citation

If you find this code is helpful, please cite our paper:

@article{zeng-etal-2021-EANQG,
    title = {Improving Paragraph-level Question Generation with Extended Answer Network and Uncertainty-aware Beam Search},
    author = {Hongwei Zeng, Zhuo Zhi, Jun Liu and Bifan Wei},
    url = {https://github.com/hongweizeng/paragraph-level-QG},
    booktitle = {Information Sciences},
    year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
datasets		datasets
models		models
search		search
train		train
utils		utils
README.md		README.md
main.py		main.py
preprocess.py		preprocess.py
statistics.py		statistics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paragraph-level-QG

1. Preprocess Dataset

1.1 [Article-level Split] [Du et al., 2017] (SQuAD Split-1)

1.2 [Sentence-level Split] [Zhou et al., 2017] [Zhao et al., 2018] (SQuAD Split-2)

1.3 NewsQA:

2. Train & Test

References

Citation

About

Languages

hongweizeng/paragraph-level-QG

Folders and files

Latest commit

History

Repository files navigation

paragraph-level-QG

1. Preprocess Dataset

1.1 [Article-level Split] [Du et al., 2017] (SQuAD Split-1)

1.2 [Sentence-level Split] [Zhou et al., 2017] [Zhao et al., 2018] (SQuAD Split-2)

1.3 NewsQA:

2. Train & Test

References

Citation

About

Resources

Stars

Watchers

Forks

Languages