Skip to content

Implementation of the NeurIPS 2017 paper "Attention is All You Need" from scratch

Notifications You must be signed in to change notification settings

GnanaPrakashSG2004/Attention_Is_All_You_Need_From_Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File Structure:

.
├── Assignment_2.pdf
├── corpus
│   ├── dev.en
│   ├── dev.fr
│   ├── processed_dev.en
│   ├── processed_dev.fr
│   ├── processed_test.en
│   ├── processed_test.fr
│   ├── processed_train.en
│   ├── processed_train.fr
│   ├── test.en
│   ├── test.fr
│   ├── train.en
│   └── train.fr
├── metrics
│   └── testbleu.txt
├── plots.ipynb
├── README.md
├── Report.pdf
└── src
    ├── decoder.py
    ├── encoder.py
    ├── __init__.py
    ├── test.py
    ├── train.py
    └── utils.py

Instructions to run the code:

  • All scripts can be run by staying in the current (root) directory.

Processing the corpus for the first time:

  • Run the following command to process the corpus for the first time:
python -m src.utils
  • This will create the processed files in the corpus directory with processed_ prefix.

Training the model:

  • Run the following command to train the model:
python -m src.train
  • This will train the model and save the model in the weights directory.
  • To change the hyperparameters, you can pass them as arguments to the above command. For example:
python -m src.train --epochs 10 --batch_size 64
  • To see all the hyperparameters, you can run:
python -m src.train --help

Testing the model:

  • Run the following command to test the model:
python -m src.test
  • This will test the model and save the BLEU score for the test set in the metrics directory.

Loading the model:

  • The file weights/transformer.pt is a state dict containing various parameters of the model and other information computed during the training process.
  • A list of these parameters is as follows:
    • args: Hyperparameters of the training run.
    • model_state_dict: The state dict of the model.
    • optimizer_state_dict: The state dict of the optimizer.
    • en_seq_len: The sequence length of the English train dataset.
    • fr_seq_len: The sequence length of the French train dataset.
    • word2idx_en: The word to index mapping of the English train dataset.
    • word2idx_fr: The word to index mapping of the French train dataset.
    • idx2word_en: The index to word mapping of the English train dataset.
    • idx2word_fr: The index to word mapping of the French train dataset.
    • train_losses: The training losses.
    • dev_losses: The validation losses.
    • dev_bleu: The BLEU score on the validation set.
    • dev_rouge: The ROUGE score on the validation set.
  • To load the model, you can use the following code:
model = Transformer(args) # Pass the necessary args
model.load_state_dict(torch.load('weights/transformer.pt')['model_state_dict'])

Implementation Assumptions:

  • Used a sequence length of 77 for faster training during hyperparameter tuning.
  • If a sequence length of 77 is used, the model will assume that apart from the <BOS> and <EOS> tokens, the maximum length of the sentence is 77.
  • The model will pad the sentences to a length of 77 if the sentence is shorter than 77.
  • The model will truncate the sentences to a length of 77 if the sentence is longer than 77.
  • The model will not learn the padding token.
  • The position embeddings are not learned.
  • The model uses the Adam optimizer with a learning rate of 0.0001.
  • The model uses the CrossEntropyLoss loss function.

Pretrained Model weights:

  • The main model with 6 blocks, 8 heads, 512-dim embeddings and 0.1 has been trained for 10 epochs on the maximum sequence length of the training data.
    • The model can be downloaded from here
  • Some more models saved during the hyperparameter tuning process can be downloaded from the following links:
    • 6 blocks, 8 heads, 512-dim embeddings and 0.1 dropout rate trained for 10 epochs with a sequence length of 77:
    • 6 blocks, 8 heads, 640-dim embeddings and 0.2 dropout rate trained for 10 epochs with a sequence length of 77:
    • 8 blocks, 10 heads, 640-dim embeddings and 0.1 dropout rate trained for 10 epochs with a sequence length of 77:

About

Implementation of the NeurIPS 2017 paper "Attention is All You Need" from scratch

Topics

Resources

Stars

Watchers

Forks