Transformers Pre-Training: Masked Language Modeling

Masked Language Modeling is a pre-training technique to teach transformers the semantics of a language by essentially asking it to fill in the blanks.

Then the pre-trained model can be finetuned for downstream tasks such as classification, generation, etc.

Kaggle Notebook: Masked Language Modeling from Scratch

>>> predict_mask('hello! how are you?')
masked: you predicted: you
ACTUAL: hello! how are you?
MASKED: hello! how are [MASK]?                                        
 MODEL: hello! how are you?

Objectives:

Implementing Encoder-Only Transformer model ✅️
Preparing the dataset from scratch ✅️
Training BERT-like tokenizer from scratch ✅️
Training from scratch with MLM objective ✅️
Trained on Wikipedia dataset from scratch ✅️

Preparing Dataset

in Masked Language Modeling, the loss calculation is similar to that of causalLMs
the inputs and the labels are identical in terms of position
in the inputs:
- 15% of the tokens are masked / replaced randomly by the [MASK] token
- this 15% doesn't include the pad tokens
in the labels:
- the ground-truth tokens which were masked in the inputs are present in the labels
- all other tokens are ignored (set to -100) by the default behaviour of nn.CrossEntropyLoss

This is how a sample looks

1 is the [PAD] token
2 is the [MASK] token
model max length = 128

tensor([5680,   10,  313,    2,    2, 4541,   14, 5393, 5404,   70,   11,  153,
          40, 2319,    2, 7560,   14, 1681, 3534,  148, 1649,   16,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1])
tensor([-100, -100, -100,  939, 1058, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, 1246, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100])

Encoder-Only Transformer Model

straightforward simple implementation.
nn.LayerNorm replaced with RMSNorm which is preferred to by many. Code from here.
It looks like BERT but it is not BERT. BERT is more complicated than this.
Only implementing the MLM part of BERT so no need of [CLS] and [SEP] tokens
Learned positional embeddings instead of sinusoidal in BERT.
We can have a mask for the encoder self-attention as well by masking out the pad tokens so attention layers ignore the extra stuff.
For inference currently only supports batch size of 1.
After the encoder outputs pass through the dim->vocab Linear layer, the logits at the position where the token was masked are softmaxed and then with argmax the token that's supposed to be there is predicted.

out: 1 x 128 x 256
if the input sequence for inference was masked at position 4, we extract 1 x 256 at index 4:
preds: out[:,4,:]
softmax -> argmax
preds: predicted token

Examples

while the predicted word may not be exact, it conveys the meaning of the sentence pretty well as it understands the context of the entire sentence.

masked: feed predicted: feed
ACTUAL: The larvae are black and flattened and feed on snails as well.
MASKED: the larvae are black and flattened and [MASK] on snails as well.                                
MODEL: the larvae are black and flattened and feed on snails as well.

---

masked: facility predicted: school
ACTUAL: Throughout the year, the facility hosts a variety of educational programs.
MASKED: throughout the year, the [MASK] hosts a variety of educational programs.
MODEL: throughout the year, the school hosts a variety of educational programs.

---

masked: provide predicted: create
ACTUAL: IRIS has partnered with other NGOs to provide funding for services, such as with The Vision Charity in Sri Lanka.
MASKED: iris has partnered with other ngos to [MASK] funding for services, such as with the vision charity in sri lanka.                       
MODEL: iris has partnered with other ngos to create funding for services, such as with the vision charity in sri lanka.

you can check the notebook for more output examples.

Conclusion

Needs more metrics suitable for this.
Maybe will try finetuning this model for downstream tasks later in the future.
Inference should work for batch_size > 1, but I have limited time to work on this, this gets the job done.

Your word is a lamp to my feet and a light to my path. Psalm 119:105

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
masked-language-modeling.ipynb		masked-language-modeling.ipynb
model.py		model.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformers Pre-Training: Masked Language Modeling

Kaggle Notebook: Masked Language Modeling from Scratch

Objectives:

Preparing Dataset

This is how a sample looks

Encoder-Only Transformer Model

Examples

Conclusion

About

Languages

shreydan/masked-language-modeling

Folders and files

Latest commit

History

Repository files navigation

Transformers Pre-Training: Masked Language Modeling

Kaggle Notebook: Masked Language Modeling from Scratch

Objectives:

Preparing Dataset

This is how a sample looks

Encoder-Only Transformer Model

Examples

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages