lectuire6

tsamardzic · Oct 26, 2023 · aed5e41 · aed5e41
1 parent 08bf169
commit aed5e41
Show file tree

Hide file tree

Showing 2 changed files with 70 additions and 1 deletion.
diff --git a/6.md b/6.md
@@ -0,0 +1,69 @@
+## 6. Text encoding with Transformers NNs 
+
+
+> Explanations, formulas, visualisations: 
+> -  Jay Alammar's blog: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
+> -  Jurafsky-Martin [10](https://web.stanford.edu/~jurafsky/slp3/10.pdf)
+> -  Lena Voita's blog: [Sequence to Sequence (seq2seq) and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)
+
+&nbsp; 
+
+
+### Encoder-decoder framework (similar to earlier seq2seq)
+
+- One huge neural network
+- The encoder part provides the representation of the input
+- The decoder part generates a new sequence, given the representation of the input
+
+
+### Difference with word2vec: Better, contextual, "dynamic" (sub)word vectors 
+
+- We basically represent sequences of symbols (subwords), not single words
+- The result of text encoding with Transformers is a representation for each subword segment in the given sentence. This is a dynamic representation because it depends on the sentence as opposed to "static" representations (e.g. word2vec).
+- With the self-attention mechanism, we can extract more information from the context, we can select more relevant contexts.  
+
+&nbsp; 
+
+### Similar to word2vec: Training with self-supervision 
+
+- Masked language modelling as a training goal (objective, task) 
+- Cross-entropy (comparing probability distributions) as a loss function
+
+&nbsp; 
+
+### Reasons for the large number of parameters
+
+- Instead of extracting one vector per word (like in word2vec), we store and reuse the whole network, all weights
+- Multihead attention: need to repeat the attention mechanism several times, with varied parameter initialisations  
+- Stacked FFNNs encoders: need to repeat the whole encoding process several times to achieve good results  
+
+
+
+### Difference with previous seq2seq: Generalised attention
+
+(More details in the following lectures)
+
+- The notion of attention comes from encoder-decoder RNNs built for machine translation: it allows the decoder to select the most relevant encoder states when generating the output. 
+- Generalised as self-attention this mechanism allows to find the most relevant contexts for encoding the input. 
+- It helps increases parallel computation because the input sequence (e.g. a sentence) is broken down into many pairs of words; we can disregard the order of words. 
+- Positional encoding: an additional function needed to make up for disregarding the order of words 
+
+&nbsp; 
+
+### Difference with word2vec, similar to previous seq2seq: Subword tokenization
+
+(More details in the following lectures) 
+
+- Control over the size of the vocabulary
+- Dealing with unknown words
+
+&nbsp; 
+
+
+
+
+--------------
+
+
+
+&nbsp; 
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ These notes should be used as a guide for acquiring the most important notions a
 
 #### 5. [History of language modelling](https://tsamardzic.github.io/nlp_intro/5.html)
 
-#### 6. Language modelling with Transformers NNs
+#### 6. [Language modelling with Transformers NNs](https://tsamardzic.github.io/nlp_intro/6.html)
 
 #### 7. Attention in language modelling