Skip to content

Latest commit

 

History

History
71 lines (37 loc) · 2.63 KB

6.md

File metadata and controls

71 lines (37 loc) · 2.63 KB

6. Text encoding with Transformers NNs

Explanations, formulas, visualisations:

 

Encoder-decoder framework (similar to earlier seq2seq)

  • One huge neural network
  • The encoder part provides the representation of the input
  • The decoder part generates a new sequence, given the representation of the input

 

Similar to word2vec: Training with self-supervision

  • Masked language modelling as a training goal (objective, task)
  • Cross-entropy (comparing probability distributions) as a loss function

 

Difference with word2vec: Better, contextual, "dynamic" (sub)word vectors

  • We basically represent sequences of symbols (subwords), not single words
  • The result of text encoding with Transformers is a representation for each subword segment in the given sentence. This is a dynamic representation because it depends on the sentence as opposed to "static" representations (e.g. word2vec).
  • With the self-attention mechanism, we can extract more information from the context, we can select more relevant contexts.

 

Reasons for the large number of parameters

  • Instead of extracting one vector per word (like in word2vec), we store and reuse the whole network, all weights
  • Multihead attention: need to repeat the attention mechanism several times, with varied parameter initialisations
  • Stacked FFNNs encoders: need to repeat the whole encoding process several times to achieve good results

 

Subword tokenization: similar to previous seq2seq, different from word2vec

(More details in the following lectures)

  • Control over the size of the vocabulary
  • Dealing with unknown words

 

Generalised attention: different from previous seq2seq:

(More details in the following lectures)

  • The notion of attention comes from encoder-decoder RNNs built for machine translation: it allows the decoder to select the most relevant encoder states when generating the output.
  • Generalised as self-attention this mechanism allows to find the most relevant contexts for encoding the input.
  • It helps increases parallel computation because the input sequence (e.g. a sentence) is broken down into many pairs of words; we can disregard the order of words.
  • Positional encoding: an additional function needed to make up for disregarding the order of words