Skip to content

Commit

Permalink
lectuire6
Browse files Browse the repository at this point in the history
  • Loading branch information
tsamardzic committed Oct 26, 2023
1 parent 08bf169 commit aed5e41
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 1 deletion.
69 changes: 69 additions & 0 deletions 6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
## 6. Text encoding with Transformers NNs


> Explanations, formulas, visualisations:
> - Jay Alammar's blog: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
> - Jurafsky-Martin [10](https://web.stanford.edu/~jurafsky/slp3/10.pdf)
> - Lena Voita's blog: [Sequence to Sequence (seq2seq) and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)
 


### Encoder-decoder framework (similar to earlier seq2seq)

- One huge neural network
- The encoder part provides the representation of the input
- The decoder part generates a new sequence, given the representation of the input


### Difference with word2vec: Better, contextual, "dynamic" (sub)word vectors

- We basically represent sequences of symbols (subwords), not single words
- The result of text encoding with Transformers is a representation for each subword segment in the given sentence. This is a dynamic representation because it depends on the sentence as opposed to "static" representations (e.g. word2vec).
- With the self-attention mechanism, we can extract more information from the context, we can select more relevant contexts.

 

### Similar to word2vec: Training with self-supervision

- Masked language modelling as a training goal (objective, task)
- Cross-entropy (comparing probability distributions) as a loss function

 

### Reasons for the large number of parameters

- Instead of extracting one vector per word (like in word2vec), we store and reuse the whole network, all weights
- Multihead attention: need to repeat the attention mechanism several times, with varied parameter initialisations
- Stacked FFNNs encoders: need to repeat the whole encoding process several times to achieve good results



### Difference with previous seq2seq: Generalised attention

(More details in the following lectures)

- The notion of attention comes from encoder-decoder RNNs built for machine translation: it allows the decoder to select the most relevant encoder states when generating the output.
- Generalised as self-attention this mechanism allows to find the most relevant contexts for encoding the input.
- It helps increases parallel computation because the input sequence (e.g. a sentence) is broken down into many pairs of words; we can disregard the order of words.
- Positional encoding: an additional function needed to make up for disregarding the order of words

 

### Difference with word2vec, similar to previous seq2seq: Subword tokenization

(More details in the following lectures)

- Control over the size of the vocabulary
- Dealing with unknown words

 




--------------



 
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ These notes should be used as a guide for acquiring the most important notions a

#### 5. [History of language modelling](https://tsamardzic.github.io/nlp_intro/5.html)

#### 6. Language modelling with Transformers NNs
#### 6. [Language modelling with Transformers NNs](https://tsamardzic.github.io/nlp_intro/6.html)

#### 7. Attention in language modelling

Expand Down

0 comments on commit aed5e41

Please sign in to comment.