Master thesis work for investigating if and how well multilingual models in low-resource languages (such as Swedish) can incorporate longer contexts without re-training the models from scratch on long-context datasets in each language. The goal was to investigate if a multilingual model, such as XLM-R, could be extended into a Longformer model and then pre-train only on English, while still having a long-context model in several languages.
The script provided includes the necessary steps to reproduce the result presented in the master thesis. We convert pre-trained monolingual and multilingual language models on English to Longformer models to a maximum model length of 4096.
We call the pre-trained models using the Longformer pre-training:
- RoBERTa-Long
- XLM-Long (weights and config are available on Huggingface here)
Based on a RoBERTa and XLM-R model that has been pre-trained using the Longformer pre-training scheme.
Training of all models are done through docker containers for reproducability
Example of how to build, start, run and shutdown the docker container and the training script
If you encounter problems, toggle the Technical Requirement
and Pre-Requisites
links to verify that you have a sufficiently large GPU and the pre-requisite applications/libraries installed.
Technical Requirements
**Please Note**: Running the following project is quite computationally expensive. It is required to have a Docker container with at least 90GB of RAM allocated for the pre-training and a CUDA enabled GPU with 48GB of memory!
For the Fine-tuning on QA tasks, 32GB of RAM is sufficient and a smaller GPU can be used when fine-tuning on regular or multilingual SQuAD. However, for the datasets created with a longer context, it requires at least 32GB of RAM
Pre-Requisites
The following applications and libraries needs to be installed in order to run the application - [Docker](https://docs.docker.com/get-docker/) - [Docker Compose](https://docs.docker.com/compose/install/) - Miniconda or Anaconda with Python3 - make (terminal command) - wget (terminal command) - unzip (terminal command) - tmux (terminal command) - CUDA enabled GPU (check if set up correctly by entering `nvidia-smi` in your terminal) - [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html) installed and linked to your Docker container (Needed if encountering error: ```ERROR: for XXX_markussagen_repl1 Cannot create container for service repl: Unknown runtime specified nvidia```)
-
Download the repo
git@github.com:MarkusSagen/Master-Thesis-Multilingual-Longformer.git cp .env.template .env
-
Download the dataset
Unzip the dataset and then place it in a suitable locationwget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip unzip wikitext-103-raw-v1.zip
-
Change your model and dataset paths
Open the.env
file and change theDATA_DIR
and theMODEL_DIR
to the relative path to where you have you want your models stored and where you downloaded the dataset. Make sure that the folders you set exist on your system.
For instance:DATA_DIR=/Users/admin/data/wikitext-103-raw MODEL_DIR=/Users/admin/model
-
Start the docker container
make build && make up
-
Start tmux
In your terminal start tmux. This will ensure that your runs are not stopped if you are disconnected from an ssh connectiontmux
-
Run the script
Here is an example of how a training script might look like for pre-training a XLM-R model into a Longformer. The general format follows the parameters of Huggingface Transformer's TrainingArgument.export SEED=42 export MAX_LENGTH=4096 export MODEL_NAME_OR_PATH=xlm-roberta-base export MODEL_NAME=xlm-roberta-to-longformer export MODEL_DIR=/workspace/models export DATA_DIR=/workspace/data export LOG_DIR=/workspace/logs make repl run="scripts/run_long_lm.py \ --model_name_or_path $MODEL_NAME_OR_PATH \ --model_name $MODEL_NAME \ --output_dir $MODEL_DIR/$MODEL_NAME \ --logging_dir $LOG_DIR/$MODEL_NAME \ --val_file_path $DATA_DIR/wiki.valid.raw \ --train_file_path $DATA_DIR/wiki.train.raw \ --seed $SEED \ --model_max_length $MAX_LENGTH \ --adam_epsilon 1e-8 \ --warmup_steps 500 \ --learning_rate 3e-5 \ --weight_decay 0.01 \ --max_steps 6000 \ --evaluate_during_training \ --logging_steps 50 \ --eval_steps 50 \ --save_steps 6000 \ --max_grad_norm 1.0 \ --per_device_eval_batch_size 2 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 64 \ --overwrite_output_dir \ --fp16 \ --do_train \ --do_eval "
-
Shutdown run and container
make down
-
(Optional) terminate tmux
exit
The training of these models were done in two steps:
- Pre-train
RoBERTa-base
andXLM-R-base
models into Longformer models - Fine-tune regular RoBERTa and XLM-R models on SQuAD formated dataset. Compare the results of these with our Longformer trained models and a Longformer model released by the Longformer authors. We train these models with multiple different seeds, datasets and context length.
We have grouped each model trained and evaluated based on:
- The dataset and language used for each model
- Then based on what model that were trained
The models were trained according to this structure
English Pre-training
|-- Wikitext-103
|-- RoBERTa-Long (4096)
|-- XLM-Long (4096)
Each fine-tuning are grouped based on the dataset, language and context length and then evaluated for each model. For more in-depth explanation of the pre-training script and parameters, see Here.
Runs:
Wikitext-103
export SEED=42
export MAX_LENGTH=4096
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-long
export DATA_DIR=/workspace/data
export LOG_DIR=/workspace/logs
make repl run="scripts/run_long_lm.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--model_name $MODEL_NAME \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--val_file_path $DATA_DIR/wiki.valid.raw \
--train_file_path $DATA_DIR/wiki.train.raw \
--seed $SEED \
--model_max_length $MAX_LENGTH \
--adam_epsilon 1e-8 \
--warmup_steps 500 \
--learning_rate 3e-5 \
--weight_decay 0.01 \
--max_steps 6000 \
--evaluate_during_training \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 500 \
--max_grad_norm 1.0 \
--per_device_eval_batch_size 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 64 \
--overwrite_output_dir \
--fp16 \
--do_train \
--do_eval
"
export SEED=42
export MAX_LENGTH=4096
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=xlm-roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-long
export DATA_DIR=/workspace/data
export LOG_DIR=/workspace/logs
make repl run="scripts/run_long_lm.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--model_name $MODEL_NAME \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--val_file_path $DATA_DIR/wiki.valid.raw \
--train_file_path $DATA_DIR/wiki.train.raw \
--seed $SEED \
--model_max_length $MAX_LENGTH \
--adam_epsilon 1e-8 \
--warmup_steps 500 \
--learning_rate 3e-5 \
--weight_decay 0.01 \
--max_steps 6000 \
--evaluate_during_training \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 500 \
--max_grad_norm 1.0 \
--per_device_eval_batch_size 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 64 \
--overwrite_output_dir \
--fp16 \
--do_train \
--do_eval
"
English QA Fine-Tuning:
|-- SQuAD
|-- RoBERTa (512)
|-- Longformer (4096)
|-- RoBERTa-Long (4096)
|-- XLM-R (512)
|-- XLM-Long (4096)
|-- SQ3 (512)
|-- RoBERTa (512)
|-- Longformer (4096)
|-- RoBERTa-Long (4096)
|-- XLM-R (512)
|-- XLM-Long (4096)
|-- SQ3 (2048)
|-- RoBERTa (512)
|-- Longformer (4096)
|-- RoBERTa-Long (4096)
|-- XLM-R (512)
|-- XLM-Long (4096)
|-- TriviaQA = TODO
Multilingual QA Fine-Tuning:
|-- XQuAD
|-- RoBERTa (512)
|-- XLM-R (512)
|-- XLM-Long (4096)
|-- XQ3 (512)
|-- XLM-R (512)
|-- XLM-Long (4096)
|-- XQ3 (2048)
|-- XLM-R (512)
|-- XLM-Long (4096)
|-- MLQA
|-- XLM-R (512)
|-- XLM-Long (4096)
We fine-tune the models on SQuAD-formated extractive question-answer datasets in English and multiple other languages. We also create a concatenated dataset with longer context for both the SQuAD and XQuAD (multilingual SQuAD). The datasets are provided through Huggingface's Datasets library.
For more more in-depth information regarding how the fine-tuning scripts, parameters and evaluation setup, see Here
Runs:
Each fine-tuning are grouped based on the dataset, language and context length and then evaluated for each model.
SQuAD
export SEED=42
export DATASET=squad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=squad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=allenai/longformer-base-4096
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=squad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=squad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=xlm-roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=squad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
SQ3 (512)
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=allenai/longformer-base-4096
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=xlm-roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
SQ3 (4096)
export SEED=42
export MAX_LENGTH=2048
export NR_CONCATS=3
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=allenai/longformer-base-4096
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 32 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=2048
export NR_CONCATS=3
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 32 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=2048
export NR_CONCATS=3
export DATASET=squad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 32 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
TODO TriviaQA (4096)
XQuAD
export SEED=42
export DATASET=xquad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=xquad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=xlm-roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=xquad
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
XQ3 (512)
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=xquad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=xlm-roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
export SEED=42
export MAX_LENGTH=512
export NR_CONCATS=1
export DATASET=xquad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
XQ3 (4096)
export SEED=42
export MAX_LENGTH=2048
export NR_CONCATS=3
export DATASET=xquad_long
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 32 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--nr_concats $NR_CONCATS \
--max_length $MAX_LENGTH \
"
MLQA
export SEED=42
export DATASET=mlqa
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=xlm-roberta-base
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
export SEED=42
export DATASET=mlqa
export MODEL_DIR=/workspace/models
export MODEL_NAME_OR_PATH=$MODEL_DIR/xlm-roberta-base-long
export MODEL_NAME=$MODEL_NAME_OR_PATH-seed-$SEED-on-$DATASET
export LOG_DIR=/workspace/logs
export DATA_DIR=/workspace/data
# Debugging
CUDA_LAUNCH_BLOCKING=1
# model args
make repl run="scripts/finetune_qa_models.py \
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $MODEL_DIR/$MODEL_NAME \
--logging_dir $LOG_DIR/$MODEL_NAME \
--dataset $DATASET \
--data_dir $DATA_DIR \
--seed $SEED \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 32 \
--gradient_accumulation_steps 8 \
--overwrite_output_dir \
--evaluate_during_training \
--fp16 \
--do_train \
--do_eval \
--do_lowercase \
--max_length 512 \
"
Many thanks to the Longformer Authors for providing reproducible training scripts and Huggingface for open-sourcing their models and frameworks. I would like to thank my supervisor at Peltarion Philipp Eisen for his invaluable feedback, insight and availability. Thank you Professor Joakim Nivre for insightful and thorough feedback and for taking the time out of your busy schedule. A massive thank you to all the wonderful people at Peltarion for the opportunity to work on such an interesting project.
You can read the report here
@mastersthesis{Sagen1545786,
author = {Sagen, Markus},
institution = {Uppsala University, Department of Information Technology},
pages = {45},
school = {Uppsala University, Department of Information Technology},
title = {Large-Context Question Answering with Cross-Lingual Transfer},
series = {UPTEC IT},
ISSN = {1401-5749},
number = {21003},
year = {2021}
}
The model weights and config for the XLM-Long are available at Huggingface.
Import as model_name:markussagen/xlm-roberta-longformer-base-4096
For questions regarding the code or the master thesis in general add an issue in the repo or contact:
markus.john.sagen@gmail.com
- Include plots and table for the evaluations
- Create bash scripts to fine-tune models on all seeds. Just send in model name