Explanations, formulas, visualisations:
- Jay Alammar's blog: The Illustrated Transformer
- Jurafsky-Martin 10
- Lena Voita's blog: Sequence to Sequence (seq2seq) and Attention
- One huge neural network
- The encoder part provides the representation of the input
- The decoder part generates a new sequence, given the representation of the input
- Masked language modelling as a training goal (objective, task)
- Cross-entropy (comparing probability distributions) as a loss function
- We basically represent sequences of symbols (subwords), not single words
- The result of text encoding with Transformers is a representation for each subword segment in the given sentence. This is a dynamic representation because it depends on the sentence as opposed to "static" representations (e.g. word2vec).
- With the self-attention mechanism, we can extract more information from the context, we can select more relevant contexts.
- Instead of extracting one vector per word (like in word2vec), we store and reuse the whole network, all weights
- Multihead attention: need to repeat the attention mechanism several times, with varied parameter initialisations
- Stacked FFNNs encoders: need to repeat the whole encoding process several times to achieve good results
(More details in the following lectures)
- Control over the size of the vocabulary
- Dealing with unknown words
(More details in the following lectures)
- The notion of attention comes from encoder-decoder RNNs built for machine translation: it allows the decoder to select the most relevant encoder states when generating the output.
- Generalised as self-attention this mechanism allows to find the most relevant contexts for encoding the input.
- It helps increases parallel computation because the input sequence (e.g. a sentence) is broken down into many pairs of words; we can disregard the order of words.
- Positional encoding: an additional function needed to make up for disregarding the order of words