Skip to content

sumuzhao/Investigate-BERT-Non-linearity-Commutativity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Investigate BERT on Non-linearity and Layer Commutativity

In this repository, we implemented some empirical experiments based on the official BERT code.

Paper refers to Of Non-Linearity and Commutativity in BERT.

Get started

Code Structure:

  • modeling, optimization, tokenization. Define BERT structures, optimizer, tokenizer.
  • run_pretraining, run_classifier. Pre-training and fine-tuning on GLUE tasks.
  • create_pretraining_data. Create pre-processed data for pre-training (Unlabeled large corpus --> TFRecord files).
  • run_finetune_glue.sh. Script for fine-tuning BERT on all GLUE tasks.
  • data_utils. Data processor for GLUE fine-tuning.
  • graph-mode. Refactorize run_pre-training.py and run_classifier.py in graph mode instead of using Tensorflow Estimator API
  • non-linearity. Experiments including training linear/non-linear approximators, replacing, removing, freezing and extracting hidden embeddings.
  • layer-commutativity. Experiments including swapping and shuffling
  • comparison. Compare with simple MLP and CNN models.

Main dependencies:

  • Python 3.7.5
  • Tensorflow 1.14.0
  • Pytorch 1.5.0

Others refer to the requirements file.

Pre-training & Fine-tuning

In this work, we mainly experiment on BERT-base and BERT-small. For BERT-base, we use the pre-trained weights provided by the official. For BERT-small, we pre-trained by ourselves, but you can also use the official pre-trained weights.

Model Layer Head Hidden Size Max Seq Length #Params Pre-trained Weights
BERT-base 12 12 768 512 110M Official
BERT-small 6 8 512 128 35M Official

BERT official team did not release the pre-processed data for pre-training, and the corpora they used like English Wikipedia and Book Corpus are not available. So we use the OpenWebText to pre-train the BERT-small, which is composed of 38GB text. Note that, it took around 8 days to pre-train the BERT-small on a single Nvidia Tesla V100 GPU card and 5 days on two cards.

Converting the original unlabeded corpus to TFRecord files is both time- and resource- comsuming. We recommend that you directly use the pre-trained weights. But if you want to use your own text to pre-train BERT, you can use following commands.

To save your time, we provided our pre-trained weights of BERT-small. Download

Here are some commands for pre-training. Create pretraining data

python create_pretraining_data.py \
  --input_file=$RAW_TEXT_DIR \
  --output_file=$TFRECORD_DIR \
  --vocab_file=$MODEL_DIR/vocab.txt \
  --do_lower_case=true \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5

Pre-training using Estimator API single GPU card

CUDA_VISIBLE_DEVICES="0" python run_pretraining.py \
  --input_file=$TFRECORD_DIR/*.tfrecord \
  --output_dir=$MODEL_DIR \
  --do_train=true \
  --do_eval=true \
  --bert_config_file=$MODEL_DIR/bert_config.json \
  --train_batch_size=256 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=1000000 \
  --num_warmup_steps=10000 \
  --learning_rate=1e-4 \
  --model_type=origin

Multiple GPU cards, implemented using hovorod

CUDA_VISIBLE_DEVICES="2,3" mpirun -np 2 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 -mca btl ^openib \
    python run_pretraining_hvd.py \
    --input_file=$TFRECORD_DIR/*.tfrecord \
    --output_dir=$MODEL_DIR \
    --do_train=true \
    --do_eval=true \
    --bert_config_file=$MODEL_DIR/bert_config.json \
    --train_batch_size=256 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --num_train_steps=1000000 \
    --num_warmup_steps=10000 \
    --learning_rate=1e-4  

Pre-training using refactorized version, on a single GPU card

cd graph-mode/
bash run_pretrain.sh

In this work, we use GLUE benchmark for fine-tuning. Here we have several methods.

Fine-tune on a GLUE task, e.g. MNLI-matched. Use Estimator API.

CUDA_VISIBLE_DEVICES="0" python run_classifier.py \
    --task_name=MNLIM  \
    --do_train=true  \
    --do_eval=true  \
    --do_predict=true  \
    --data_dir=$GLUE_DIR/MNLI  \
    --vocab_file=$MODEL_DIR/vocab.txt  \
    --bert_config_file=$MODEL_DIR/bert_config.json \
    --init_checkpoint=$MODEL_DIR/bert_model.ckpt  \
    --max_seq_length=128  \
    --train_batch_size=32  \
    --learning_rate=2e-5  \
    --num_train_epochs=3.0  \
    --output_dir=$MODEL_Finetune_DIR/mnlim_output/

Use refactorized version (Need to modify the parameters inside the bash script)

cd graph-mode/
bash run_finetune.sh

Fine-tune all GLUE tasks.

bash run_finetune_glue.sh

For easily reading the fine-tuning results.

python glue_result.reader.py --dir=$MODEL_Finetune_DIR

Releases

No releases published

Packages

No packages published