This repo hosts the implmentation of our paper: When Chosen Wisely, More Data Is What You Need: A Universal Sample-Efficient Strategy For Data Augmentation, published at ACL 2022 Findings. The codebase follows our previous work: Minimax-kNN.
We present a sample efficient technique for incorporating augmented data into training. The key ingredient of our approach is to find worst-case examples based on a selection criterion, inspired by adversarial training.
First, create a conda environment (Python 3.6+), and install PyTorch 1.6+. Note that the repo is tested with Python 3.8 and PyTorch 1.7 (CUDA 10.1):
conda create -n glitter python=3.8
conda activate glitter
conda install pytorch==1.7.1 cudatoolkit=10.1 -c pytorch
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
To install the dependencies:
pip install -e .
Our experiments are conducted on GLUE, SQuAD, and HellaSwag. To prepare the data, simply run the following commands:
python -m glitter.preprocess.glue --dataset glue/rte --output_dir /path/to/RTE
python -m glitter.preprocess.squad --output_dir /path/to/squad
python -m glitter.preprocess.hellaswag --output_dir /path/to/HellaSwag
The data directory should contain three files: train.jsonl
, dev.jsonl
, and test.jsonl
.
For MNLI, the dev set and the test set should be named: dev_matched.jsonl
, dev_mismatched.jsonl
, test_matched.jsonl
, and test_mismatched.jsonl
.
Tab-separated .tsv
files are also supported.
We use three Data Augmentation (DA) strategies:
- Back-translation: fairseq is required to generate augmented data using back-translation. The scripts do not support multiple GPUs.
CUDA_VISIBLE_DEVICES="0" python aug/backt.py --task rte --data_dir /path/to/RTE --output_dir /path/to/RTE_augmented --do_sample
CUDA_VISIBLE_DEVICES="0" python aug/qa_backt.py --data_dir /path/to/squad --output_dir /path/to/squad_augmented --do_sample
CUDA_VISIBLE_DEVICES="0" python aug/multichoice_backt.py --data_dir /path/to/HellaSwag --output_dir /path/to/HellaSwag_augmented --do_sample
- Easy Data Augmentation (Wei & Zou, EMNLP 2019): Additional dependencies to generate EDA examples are nltk and nlpaug:
python aug/eda.py --task rte --data_dir /path/to/RTE --output_dir /path/to/RTE_augmented
- Mask-and-reconstruct:
CUDA_VISIBLE_DEVICES="0,1" python aug/mlm.py --model_name_or_path bert-large-uncased-whole-word-masking --task rte --data_dir /path/to/RTE --output_dir /path/to/RTE_augmented --do_sample --topk 10
The size of augmented data is controlled by --num_augments
(default: 8) in the above commends.
The data should be stored in a jsonl format where each line corresponds to an example in a json format:
{
"guid": "8",
"label": "entailment",
"text_a": "(Read for Slate 's take on Jackson's findings.)",
"text_b": "Slate had an opinion on Jackson's findings.",
"augmented_samples":[
{
"guid":"aug-1",
"text_a": "( Read for Slate's notes on Jackson's findings. )",
"text_b": "Slate had an opinion on Jackson's findings."
},
...
]
}
Note that the above example is reformatted for better visualization.
GLUE
CUDA_VISIBLE_DEVICES="0,1" python -m glitter.glue --model_name_or_path distilroberta-base \
--gradient_accumulation_steps 4 --learning_rate 3e-5 --warmup_ratio 0.08 \
--num_train_epochs 20 --train_batch_size 32 --eval_batch_size 32 \
--gpus -1 --fp16 --do_train --max_seq_length 256 --task mnli \
--patience 20 --weight_decay 0.0 --eval_interval 0.5 --overwrite_output_dir \
--data_dir /path/to/MNLI --output_dir /path/to/saved_model
The supported tasks are: cola
, sst-2
, mrpc
, sts-b
, qqp
, mnli
, mnli-mm
, qnli
, and rte
.
SQuAD
CUDA_VISIBLE_DEVICES="0,1" python -m glitter.squad --model_name_or_path distilroberta-base \
--gradient_accumulation_steps 4 --learning_rate 1.5e-5 --warmup_ratio 0.06 \
--num_train_epochs 3 --train_batch_size 8 --eval_batch_size 16 \
--gpus -1 --fp16 --do_train --max_seq_length 512 \
--weight_decay 0.01 --overwrite_output_dir \
--data_dir /path/to/squad --output_dir /path/to/saved_model
HellaSwag
CUDA_VISIBLE_DEVICES="0,1" python -m glitter.mc --model_name_or_path distilroberta-base \
--gradient_accumulation_steps 1 --learning_rate 1.5e-5 --warmup_ratio 0.06 \
--adam_epsilon 1e-6 --adam_beta2 0.98 --padding longest --pad_to_multiple_of 8 \
--num_train_epochs 20 --train_batch_size 16 --eval_batch_size 16 \
--gpus -1 --fp16 --do_train --max_seq_length 512 --task hellaswag \
--weight_decay 0.01 --overwrite_output_dir --save_last \
--data_dir /path/to/HellaSwag --output_dir /path/to/saved_model
Same as fine-tuning, but with an additional argument --teacher_name_or_path /path/to/teacher_model
.
To reduce the runtime, we build a cache from the outputs of the teacher model before start of the training.
Other options, summarized below, are also available for distillation:
Argument | Default | Description |
---|---|---|
teacher_name_or_path | None | Path to a directory containing the teacher model (i.e., pytorch_model.bin). Setting this argument triggers the distillation process. |
alpha_ce | 0.5 | Linear weight corresponding to cross entropy loss between teacher and student |
alpha_true | 0.5 | Linear weight corresponding to cross entropy loss between true labels and predictions |
temperature | 5.0 | Softmax temperature that determines softness of output probabilities |
An example command on a GLUE task looks like:
CUDA_VISIBLE_DEVICES="0,1" python -m glitter.glue --model_name_or_path distilroberta-base \
--gradient_accumulation_steps 4 --learning_rate 3e-5 --warmup_ratio 0.08 \
--num_train_epochs 20 --train_batch_size 32 --eval_batch_size 32 \
--gpus -1 --fp16 --do_train --max_seq_length 256 --task mnli \
--patience 20 --weight_decay 0.0 --eval_interval 0.5 --overwrite_output_dir \
--data_dir /path/to/MNLI --output_dir /path/to/saved_model
--teacher_name_or_path /path/to/teacher/best_tfmr/ --temperature 12.0
The command is the same as KD, but with the following options:
Argument | Default | Description |
---|---|---|
num_augments | 0 | Number of augmentations per train samples used during training |
num_aug_candidates | 0 | Number of candidates per train samples to select augmentation samples from (works only when --num_augments > 0 and --vanilla_augment is not set) |
alpha_aug | None | Linear weight corresponding to cross-entropy loss of augmented samples between teacher and student (works only when --num_augments > 0 and --vanilla_augment is not set). By default, it would be same as --alpha_ce |
vanilla_augment | False | Ignores Glitter-DA and includes the whole augmented data in training (works only when --num_augments > 0) |
Note that --data_dir
should refer to an augmented dataset as described above.
For Glitter-DA ✨, the following command can be run:
CUDA_VISIBLE_DEVICES="0,1" python -m glitter.glue --model_name_or_path distilroberta-base \
--gradient_accumulation_steps 4 --learning_rate 3e-5 --warmup_ratio 0.08 \
--num_train_epochs 20 --train_batch_size 32 --eval_batch_size 32 \
--gpus -1 --fp16 --do_train --max_seq_length 256 --task mnli \
--patience 10 --weight_decay 0.0 --eval_interval 0.5 --overwrite_output_dir \
--data_dir /path/to/MNLI_augmented --output_dir /path/to/saved_model
--teacher_name_or_path /path/to/teacher/best_tfmr/ --temperature 12.0
--num_augments 2 --num_aug_candidates 8
For vanilla-DA, simply add --vanilla_augment
to the above command (Note that in vanilla-DA, --num_aug_candidates
has no effect and the size of augmented data can be controlled by --num_augments
).
To enable Consistency Training, run the following command (Works only for sequence classification tasks):
CUDA_VISIBLE_DEVICES="0,1,2,3" python -m glitter.semi --model_name_or_path roberta-base \
--gradient_accumulation_steps 4 --learning_rate 3e-5 --warmup_ratio 0.08 \
--num_train_epochs 20 --train_batch_size 16 --eval_batch_size 32 \
--gpus -1 --fp16 --do_train --max_seq_length 256 --task mnli \
--patience 10 --weight_decay 0.0 --eval_interval 0.5 --overwrite_output_dir \
--data_dir /path/to/MNLI_augmented --output_dir /path/to/saved_model
--num_augments 2 --num_aug_candidates 8 --alpha_uda 1.0 --uda_softmax_temp 5.0
For vanilla-DA, simply add --vanilla_augment
to the above command.
We recommend running evaluation commands on one GPU:
Here is the command for evaluating on the dev set:
python -m glitter.glue --model_name_or_path /path/to/saved_model/best_tfmr \
--do_eval --eval_batch_size 32 \
--task mnli --data_dir /path/to/MNLI
When labels in the test data are known, replace --do_eval
with --do_test
, to run evaluation on the test data.
When test labels are not given, we can run inference by passing --do_infer
.
If you have any questions, or encounter any problems, reach out to kamalloo (at) ualberta dot ca
.
This project's license is under the Apache 2.0 license.
If you found our work useful, please cite us as:
@inproceedings{kamalloo-etal-2022-chosen,
title = "When Chosen Wisely, More Data Is What You Need: A Universal Sample-Efficient Strategy For Data Augmentation",
author = "Kamalloo, Ehsan and
Rezagholizadeh, Mehdi and
Ghodsi, Ali",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-acl.84",
doi = "10.18653/v1/2022.findings-acl.84",
pages = "1048--1062",
}