The transformer model was first introduced in the paper:
- "Attention is all you need" by Vaswani et. al., (here)
For a really thorough and in-depth discussion of the implementation of the Transformer you can check out a blog post that I wrote about it.
The transformer is used for modelling sequence-to-sequence tasks (like machine translation), where the input sequence is encoded using an encoder and then the output sequence is produced using a decoder.
Before feeding the sequence to the transformer we have to pass the tokens through a standard embedding layer. We also need to encode the order of the sequence, since order information is not built-in. Thus, we use a position embedding layer, that maps each sequence index to a vector. The word embedding and the position embedding are then added and dropout is applied to produce the final token embedding.
The query, key and value linear layers (
Each self-attention block has several sets of
The encoder consists of
The decoder is also composed of
- The decoder uses masked self-attention, meaning that current sequence elements cannot attend future elements.
- In addition to the two sub-layers, the decoder uses a third sub-layer, which performs cross-attention between the decoded sequence and the outputs of the encoder.
The full transformer model is constructed by plugging in the outputs of the final encoder block to the cross-attention layer of every decoder block. Finally, the outputs of the final decoder block are forwarded through a linear layer to produce scores over the target vocabulary.