ROSITA

This repository contains implementation of the paper "ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques" in AAAI 2021.

The code for fine-tuning models (w/o knowledge distillation (KD)) is modified from huggingface/transformers.

The code for KD is modified from TinyBERT.

Requirements

Python3
torch=1.6.0
tqdm
boto3

Data Preparation

Step1: Download the GLUE data by running:

python download_glue_data.py --data_dir data --tasks all

The extracted .tsv files for CoLA and SST-2 are given in the data/ folder.

Step2: Download the pre-trained language model BERT (bert-base-uncased) and GloVe (glove.42B.300d) embeddings, to models/bert_pt/ and glove/ respectively. The download of BERT can be achieved by running:

python download_bert.py

Step3: Conduct data augmentation by running:

python data_augmentation.py --pretrained_bert_model models/bert_pt \
                            --glove_embs glove/glove.42B.300d.txt \
                            --glue_dir data \  
                            --task_name ${TASK_NAME}$

The augmented dataset train_aug.tsv will be automatically saved into data/${TASK_NAME}$.

Step4 (Optional): For QNLI, QQP and MNLI which have millions of augmented data, we divide train_aug.tsv into subsets to reduce the memory consumption in training.

The following (Linux) commands can be used to split the dataset:

cd data/${TASK_NAME}$
split -500000 -d train_aug.tsv train_aug

Now we have subsets with 500,000 data samples in each. Rename the subset files as train_aug0.tsv, train_aug1.tsv ...

Fine-tuning BERT-base

To fine-tune the pre-trained BERT model on a downstream task ${TASK_NAME}$, enter the directory Pruning/ and run:

python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --task_name ${TASK_NAME}$ \
  --do_train \
  --do_eval \
  --evaluate_during_training \
  --data_dir ../data/${TASK_NAME}$ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 5.0 \
  --output_dir ../models/bert_ft/${TASK_NAME}$ \
  --logging_dir ../models/bert_ft/${TASK_NAME}$/logging \
  --logging_steps 50 \
  --save_steps 0 \
  --is_prun False

Training BERT-base(student)

To train the BERT-base(student) with the fine-tuned BERT-base as teacher, enter the directory KD/ and run:

python train.py \
        --config_dir configurations/config_bert-student.json \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --aug_train

The trained BERT-base(student) model will be automatically saved into models/bert_student/${TASK_NAME}$.

Training BERT-8layer

To train the BERT-8layer with BERT-base(student) as the teacher, run:

python train.py \
        --config_dir configurations/config_bert-8layer.json \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --aug_train

In this training process, the top-most layers of BERT-base(student) will be iteratively compressed until the number of layers reduces to 8.

One-step pruning + fine-tuning w/o KD

Requirements: Fine-tuned BERT

Step1: To compress the fine-tuned BERT-base model, we first need to determine the importance of model weights. We use a metric based on first-order taylor expansion, which can be computed by entering the directory Pruning/ and runnning:

python run_glue.py \
  --model_type bert \
  --task_name ${TASK_NAME}$ \
  --data_dir ../data/${TASK_NAME}$ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --num_train_epochs 1.0 \
  --save_steps 0 \
  --model_name_or_path ../models/bert_ft/${TASK_NAME}$ \
  --output_dir ../models/bert_ft/${TASK_NAME}$ \
  --compute_taylor True \
  --is_prun False

The results will be saved to ../models/bert_ft/${TASK_NAME}$/taylor_score/taylor.pkl

Step2: Now we can conduct model compression by running:

python3 pruning_one_step.py \
        -model_path ../models/bert_ft \
        -output_dir ../models/prun_bert \
        -task ${TASK_NAME}$ \
        -keep_heads ${NUM_OF_ATTN_HEADS_TO_KEEP}$ \
        -num_layers ${NUM_OF_LAYERS_TO_KEEP}$ \
        -ffn_hidden_dim ${HIDDEN_DIM_OF_FFN}$ \
        -emb_hidden_dim ${MATRIX_RANK_OF_EMB_FACTORIZATION}$

The four hyperparameters keep_heads, keep_layers, ffn_hidden_dim and emb_hidden_dim construct a space of the model architecture. In the final setting of ROSITA, keep_heads=2, keep_layers=8, ffn_hidden_dim=512 and emb_hidden_dim=128.

Step3: To train the compressed model without KD, run:

python run_glue.py \
  --model_type bert \
  --model_name_or_path ../models/prun_bert/${TASK_NAME}$/a2_l8_f512_e128 \
  --task_name ${TASK_NAME}$ \
  --do_train \
  --do_eval \
  --evaluate_during_training \
  --data_dir ../data/${TASK_NAME}$ \
  --max_seq_length ${MAX_SEQ_LENGTH}$ \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs ${EPOCH_NUM}$ \
  --output_dir ../models/one_step_prun/${TASK_NAME}$/a2_l8_f512_e128 \
  --logging_dir ../models/one_step_prun/${TASK_NAME}$/a2_l8_f512_e128/logging \
  --logging_steps 100 \
  --save_steps 0 \
  --is_prun True

where ${MAX_SEQ_LENGTH}$ and ${EPOCH_NUM}$ for different datasets are shown as follows:

Dataset	CoLA	SST-2	QNLI	QQP	MNLI
${MAX_SEQ_LENGTH}$	64	64	128	128	128
${EPOCH_NUM}$	20	5	5	5	5

Training w/ KD

The following figure illustrates the four different KD settings.

KD Setting1: one-step pruning + one-stage KD

Requirements: Fine-tuned BERT

Step1: Compress BERT as in Step1 and Step2 of One-step pruning + fine-tuning w/o KD.

Step2: To train the compressed model with KD Setting1, enter KD/ and run:

python train.py \
        --config_dir configurations/config_setting1.json \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --aug_train

When it comes to the augmented datasets for QNLI, QQP and MNLI, we can run train_with_subset.py instead, which loads the training subsets (constructed in Data Preparation) to reduce the memory consumption. train_with_subset.py can also be used in the other KD settings.

KD Setting2: one-step pruning + two-stage KD

Requirements: Fine-tuned BERT, BERT-base(student)

Step1: Compute the weight importance metric by entering KD/ and running:

python train.py \
        --teacher_model ../models/bert_ft/${TASK_NAME}$ \
        --student_model ../models/bert_student/${TASK_NAME}$ \
        --data_dir ../data/${TASK_NAME}$ \
        --task_name ${TASK_NAME}$ \
        --output_dir ../models/bert_student/${TASK_NAME}$/taylor_score \
        --num_train_epochs 1 \
        --do_lower_case \
        --pred_distill \
        --compute_taylor

Step2: Compress the BERT-base(student) model by running:

python3 pruning_one_step.py \
        -model_path ../models/bert_student \
        -output_dir ../models/prun_bert_student \
        -task ${TASK_NAME}$ \
        -keep_heads 2 \
        -num_layers 8 \
        -ffn_hidden_dim 512 \
        -emb_hidden_dim 128

Step3: Train the compressed model by running:

python train.py \
        --config_dir configurations/config_setting2.json \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --aug_train

KD Setting3: iterative width pruning + two-stage KD

Requirements: BERT-base(student)

Step1: Compress BERT-base(student) to 8 layers by entering KD/ and running:

python3 pruning_one_step.py \
        -model_path ../models/bert_student \
        -output_dir ../models/bert-8layer/one_step_prun \
        -task ${TASK_NAME}$ \
        -keep_heads 12 \
        -num_layers 8 \
        -ffn_hidden_dim 3072 \
        -emb_hidden_dim -1

Step2: Train and iteratively compress the 8layer BERT model by running:

python train.py \
        --config_dir configurations/config_setting3.json \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --aug_train

KD Setting4: iterative width&depth pruning + three-stage KD

Requirements: BERT-8layer trained with iterative depth pruning

Train and iteratively compress the BERT-8layer model by entering KD/ and running:

python train.py \
        --config_dir configurations/config_setting4.json \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --aug_train

Evaluation

To evaluate any trained model on the dev and test sets, enter KD/ and run:

python test.py \
        --student_model ${PATH_OF_THE_MODEL_TO_EVALUATE}$ \
        --output_dir  ${PATH_TO_OUTPUT_EVALUATION_RESULTS}$ \
        --data_dir ../data/${TASK_NAME}$ \
        --task_name ${TASK_NAME}$ \
        --do_lower_case \
        --do_eval \
        --do_predict

Here we provide the ROSITA models trained under KD Setting4 for evaluation:

Citation

If you use this repository in a published research, please cite our paper:

@inproceedings{Liu2021ROSITA,
author = {Yuanxin Liu and Zheng Lin and Fengcheng Yuan},
title = {ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques},
booktitle = {AAAI 2021},
year = {2021}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROSITA

Requirements

Data Preparation

Fine-tuning BERT-base

Training BERT-base(student)

Training BERT-8layer

One-step pruning + fine-tuning w/o KD

Training w/ KD

KD Setting1: one-step pruning + one-stage KD

KD Setting2: one-step pruning + two-stage KD

KD Setting3: iterative width pruning + two-stage KD

KD Setting4: iterative width&depth pruning + three-stage KD

Evaluation

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
KD		KD
Pruning		Pruning
data		data
glove		glove
models		models
README.md		README.md
data_augmentation.py		data_augmentation.py
download_bert.py		download_bert.py
download_glue_data.py		download_glue_data.py
kd_settings.png		kd_settings.png

llyx97/Rosita

Folders and files

Latest commit

History

Repository files navigation

ROSITA

Requirements

Data Preparation

Fine-tuning BERT-base

Training BERT-base(student)

Training BERT-8layer

One-step pruning + fine-tuning w/o KD

Training w/ KD

KD Setting1: one-step pruning + one-stage KD

KD Setting2: one-step pruning + two-stage KD

KD Setting3: iterative width pruning + two-stage KD

KD Setting4: iterative width&depth pruning + three-stage KD

Evaluation

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages