How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Initial code release for the paper:

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives (ACL 2023)

Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze and Barbara Plank.

Task-Specific-Distillation

We inherit Fairseq framework for task-specific-distillation of the RoBERTa model.

Train

run task_specific_distillation/experiments.py for task-specific distillation on RoBERTa model.

python experiments.py  --task {task} --method {method} -e {experiment} -s {stage} --mapping {mapping} --init {init} --group {group} --seeds {seeds}

task: mnli, qnli, sst-2, cola, mrpc, qqp, rte

method: kd, hidden_mse_learn, hidden_mse_token, crd, att_kl_learn, att_mse_learn

Task-Agnostic-Distillation

The task-anostic-distillation code is based on the work izsak-etal-2021-train.

Data Preperation

The dataset directory includes scripts to pre-process the datasets we used in our experiments (Wikipedia, Bookcorpus). See dedicated README for full details.

Pretrain

run task_agnostic_distillation/experiments.py for distilling a transformer model from BERT_large during the pre-training stage.

python -m torch.distributed.launch run_pretraining.py --method {distillation_objective} --student_initialize ...

See task_agnostic_distillation/README.md for a complete bash code example and detailed explanation of all the training configuration.

Finetuning

Run task_agnostic_distillation/run_glue.py for finetuning a saved checkpoint on GLUE tasks.

example :

python run_glue.py \
  --model_name_or_path <path to model> \
  --task_name MRPC \
  --max_seq_length 128 \
  --output_dir /tmp/finetuning \
  --overwrite_output_dir \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --per_device_train_batch_size 32 --gradient_accumulation_steps 1 \
  --per_device_eval_batch_size 32 \
  --learning_rate 5e-5 \
  --weight_decay 0.01 \
  --eval_steps 50 --evaluation_strategy steps \
  --max_grad_norm 1.0 \
  --num_train_epochs 5 \
  --lr_scheduler_type polynomial \
  --warmup_steps 50

Cite

@misc{wang2023distill,
      title={How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives}, 
      author={Xinpeng Wang and Leonie Weissweiler and Hinrich Schütze and Barbara Plank},
      year={2023},
      eprint={2305.15032},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
task_agnostic_distillation		task_agnostic_distillation
task_specific_distillation		task_specific_distillation
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Task-Specific-Distillation

Train

Task-Agnostic-Distillation

Data Preperation

Pretrain

Finetuning

Cite

About

Releases

Packages

Languages

License

mainlp/How-to-distill-your-BERT

Folders and files

Latest commit

History

Repository files navigation

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Task-Specific-Distillation

Train

Task-Agnostic-Distillation

Data Preperation

Pretrain

Finetuning

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages