Masked Language Modeling is a pre-training technique to teach transformers the semantics of a language by essentially asking it to fill in the blanks.
Then the pre-trained model can be finetuned for downstream tasks such as classification, generation, etc.
Kaggle Notebook: Masked Language Modeling from Scratch
>>> predict_mask('hello! how are you?')
masked: you predicted: you
ACTUAL: hello! how are you?
MASKED: hello! how are [MASK]?
MODEL: hello! how are you?
- Implementing Encoder-Only Transformer model ✅️
- Preparing the dataset from scratch ✅️
- Training BERT-like tokenizer from scratch ✅️
- Training from scratch with MLM objective ✅️
- Trained on Wikipedia dataset from scratch ✅️
- in Masked Language Modeling, the loss calculation is similar to that of causalLMs
- the inputs and the labels are identical in terms of position
- in the inputs:
- 15% of the tokens are masked / replaced randomly by the [MASK] token
- this 15% doesn't include the pad tokens
- in the labels:
- the ground-truth tokens which were masked in the inputs are present in the labels
- all other tokens are ignored (set to -100) by the default behaviour of nn.CrossEntropyLoss
- 1 is the [PAD] token
- 2 is the [MASK] token
- model max length = 128
tensor([5680, 10, 313, 2, 2, 4541, 14, 5393, 5404, 70, 11, 153,
40, 2319, 2, 7560, 14, 1681, 3534, 148, 1649, 16, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])
tensor([-100, -100, -100, 939, 1058, -100, -100, -100, -100, -100, -100, -100,
-100, -100, 1246, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100])
- straightforward simple implementation.
- nn.LayerNorm replaced with RMSNorm which is preferred to by many. Code from here.
- It looks like BERT but it is not BERT. BERT is more complicated than this.
- Only implementing the MLM part of BERT so no need of [CLS] and [SEP] tokens
- Learned positional embeddings instead of sinusoidal in BERT.
- We can have a mask for the encoder self-attention as well by masking out the pad tokens so attention layers ignore the extra stuff.
- For inference currently only supports batch size of 1.
- After the encoder outputs pass through the dim->vocab Linear layer, the logits at the position where the token was masked are softmaxed and then with argmax the token that's supposed to be there is predicted.
out: 1 x 128 x 256
if the input sequence for inference was masked at position 4, we extract 1 x 256 at index 4:
preds: out[:,4,:]
softmax -> argmax
preds: predicted token
while the predicted word may not be exact, it conveys the meaning of the sentence pretty well as it understands the context of the entire sentence.
masked: feed predicted: feed
ACTUAL: The larvae are black and flattened and feed on snails as well.
MASKED: the larvae are black and flattened and [MASK] on snails as well.
MODEL: the larvae are black and flattened and feed on snails as well.
---
masked: facility predicted: school
ACTUAL: Throughout the year, the facility hosts a variety of educational programs.
MASKED: throughout the year, the [MASK] hosts a variety of educational programs.
MODEL: throughout the year, the school hosts a variety of educational programs.
---
masked: provide predicted: create
ACTUAL: IRIS has partnered with other NGOs to provide funding for services, such as with The Vision Charity in Sri Lanka.
MASKED: iris has partnered with other ngos to [MASK] funding for services, such as with the vision charity in sri lanka.
MODEL: iris has partnered with other ngos to create funding for services, such as with the vision charity in sri lanka.
you can check the notebook for more output examples.
- Needs more metrics suitable for this.
- Maybe will try finetuning this model for downstream tasks later in the future.
- Inference should work for batch_size > 1, but I have limited time to work on this, this gets the job done.
Your word is a lamp to my feet and a light to my path. Psalm 119:105