This is the code for the paper titled
This is the code and the dataset for the paper titled
accepted at Findings of EMNLP 2022.
If you end up using this code or the data, please cite our paper:
@inproceedings{joshi-etal-2022-er,
title = "{ER}-Test: Evaluating Explanation Regularization Methods for Language Models",
author = "Joshi, Brihi and
Chan, Aaron and
Liu, Ziyi and
Nie, Shaoliang and
Sanjabi, Maziar and
Firooz, Hamed and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.242",
pages = "3315--3336",
abstract = "By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (NLMs). Explanation regularization (ER) aims to improve NLM generalization by pushing the NLM{'}s machine rationales (Which input tokens did the NLM focus on?) to align with human rationales (Which input tokens would humans focus on). Though prior works primarily study ER via in-distribution (ID) evaluation, out-of-distribution (OOD) generalization is often more critical in real-world scenarios, yet ER{'}s effect on OOD generalization has been underexplored.In this paper, we introduce ER-Test, a framework for evaluating ER models{'} OOD generalization along three dimensions: unseen datasets, contrast set tests, and functional tests. Using ER-Test, we comprehensively analyze how ER models{'} OOD generalization varies with the rationale alignment criterion (loss function), human rationale type (instance-level v/s task-level), number and choice of rationale-annotated instances, and time budget for rationale annotation. Across two tasks and six datasets, we show that ER has little impact on ID performance but yields large OOD performance gains, with the best ER criterion being task-dependent. Also, ER can improve OOD performance even with task-level or few human rationales. Finally, we find that rationale annotation is more time-efficient than label annotation for improving OOD performance. Our results with ER-Test help demonstrate ER{'}s utility and establish best practices for using ER effectively.",
}
- Python 3.5.x To install the dependencies used in the code, you can use the requirements.txt file as follows -
pip install -r requirements.txt
Hydra will change the working directory to the path specified in configs/hydra/default.yaml
. Therefore, if you save a file to the path './file.txt'
, it will actually save the file to somewhere like logs/runs/xxxx/file.txt
. This is helpful when you want to version control your saved files, but not if you want to save to a global directory. There are two methods to get the "actual" working directory:
- Use
hydra.utils.get_original_cwd
function call - Use
cfg.work_dir
. To use this in the config, can do something like"${data_dir}/${.dataset}/${model.arch}/"
-
work_dir
current working directory (wheresrc/
is) -
data_dir
where data folder is -
log_dir
where log folder is (runs & multirun) -
root_dir
where the saved ckpt & hydra config are
In offline mode, results are not logged to Neptune.
python main.py logger.offline=True
In debug mode, results are not logged to Neptune, and we only train/evaluate for limited number of batches and/or epochs.
python main.py debug=True
Here, we assume the following:
- The
data_dir
is../data
, which meansdata_dir=${work_dir}/../data
. - The dataset is
sst
. - The attribution algorithm is
input-x-gradient
.
The commands below are used to build pre-processed datasets, saved as pickle files. The model architecture is specified so that we can use the correct tokenizer for pre-processing.
python scripts/build_dataset.py --data_dir ../data \
--dataset sst --split train --arch google/bigbird-roberta-base
python scripts/build_dataset.py --data_dir ../data \
--dataset sst --split dev --arch google/bigbird-roberta-base
python scripts/build_dataset.py --data_dir ../data \
--dataset sst --split test --arch google/bigbird-roberta-base
If the dataset is very large, you have the option to subsample part of the dataset for smaller-scale experiements. For example, in the command below, we build a train set with only 1000 train examples (sampled with seed 0).
python scripts/build_dataset.py \
--data_dir ../data \
--dataset sst \
--split train \
--arch google/bigbird-roberta-base \
--num_samples 1000 \
--seed 0
The command below is the most basic way to run main.py
and will train the Task LM without any explanation regularization (model=lm
).
However, since all models need to be evaluated w.r.t. explainability metrics, we need to specify an attribution algorithm for computing post-hoc explanations. This is done by setting model.explainer_type=attr_algo
to specify that we are using an attribution algorithm based explainer (as opposed to lm
or self_lm
), model.attr_algo
to specify the attribution algorithm, and model.attr_pooling
to specify the attribution pooler.
python main.py -m \
data=sst \
model=lm \
model.optimizer.lr=2e-5 \
setup.train_batch_size=32 \
setup.accumulate_grad_batches=1 \
setup.eff_train_batch_size=32 \
setup.eval_batch_size=32 \
setup.num_workers=3 \
seed=0,1,2
By default, checkpoints will not be saved (i.e., save_checkpoint=False
), so you need to set save_checkpoint=True
if you want to save the best checkpoint.
python main.py -m \
save_checkpoint=True \
data=sst \
model=lm \
model.optimizer.lr=2e-5 \
setup.train_batch_size=32 \
setup.accumulate_grad_batches=1 \
setup.eff_train_batch_size=32 \
setup.eval_batch_size=32 \
setup.num_workers=3 \
seed=0,1,2
We can also train the Task LM with ER (model=expl_reg
). ER can be done using pre-annotated gold rationales or human-in-the-loop feedback.
Provide gold rationales for all train instances:
python main.py -m \
save_checkpoint=True \
data=sst \
model=expl_reg \
model.attr_algo=input-x-gradient \
model.task_wt=1.0 \
model.pos_expl_wt=0.5 \
model.pos_expl_criterion=bce \
model.neg_expl_wt=0.5 \
model.neg_expl_criterion=l1 \
model.optimizer.lr=2e-5 \
setup.train_batch_size=32 \
setup.accumulate_grad_batches=1 \
setup.eff_train_batch_size=32 \
setup.eval_batch_size=32 \
setup.num_workers=3 \
seed=0,1,2