-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
08bf169
commit aed5e41
Showing
2 changed files
with
70 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
## 6. Text encoding with Transformers NNs | ||
|
||
|
||
> Explanations, formulas, visualisations: | ||
> - Jay Alammar's blog: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) | ||
> - Jurafsky-Martin [10](https://web.stanford.edu/~jurafsky/slp3/10.pdf) | ||
> - Lena Voita's blog: [Sequence to Sequence (seq2seq) and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html) | ||
| ||
|
||
|
||
### Encoder-decoder framework (similar to earlier seq2seq) | ||
|
||
- One huge neural network | ||
- The encoder part provides the representation of the input | ||
- The decoder part generates a new sequence, given the representation of the input | ||
|
||
|
||
### Difference with word2vec: Better, contextual, "dynamic" (sub)word vectors | ||
|
||
- We basically represent sequences of symbols (subwords), not single words | ||
- The result of text encoding with Transformers is a representation for each subword segment in the given sentence. This is a dynamic representation because it depends on the sentence as opposed to "static" representations (e.g. word2vec). | ||
- With the self-attention mechanism, we can extract more information from the context, we can select more relevant contexts. | ||
|
||
| ||
|
||
### Similar to word2vec: Training with self-supervision | ||
|
||
- Masked language modelling as a training goal (objective, task) | ||
- Cross-entropy (comparing probability distributions) as a loss function | ||
|
||
| ||
|
||
### Reasons for the large number of parameters | ||
|
||
- Instead of extracting one vector per word (like in word2vec), we store and reuse the whole network, all weights | ||
- Multihead attention: need to repeat the attention mechanism several times, with varied parameter initialisations | ||
- Stacked FFNNs encoders: need to repeat the whole encoding process several times to achieve good results | ||
|
||
|
||
|
||
### Difference with previous seq2seq: Generalised attention | ||
|
||
(More details in the following lectures) | ||
|
||
- The notion of attention comes from encoder-decoder RNNs built for machine translation: it allows the decoder to select the most relevant encoder states when generating the output. | ||
- Generalised as self-attention this mechanism allows to find the most relevant contexts for encoding the input. | ||
- It helps increases parallel computation because the input sequence (e.g. a sentence) is broken down into many pairs of words; we can disregard the order of words. | ||
- Positional encoding: an additional function needed to make up for disregarding the order of words | ||
|
||
| ||
|
||
### Difference with word2vec, similar to previous seq2seq: Subword tokenization | ||
|
||
(More details in the following lectures) | ||
|
||
- Control over the size of the vocabulary | ||
- Dealing with unknown words | ||
|
||
| ||
|
||
|
||
|
||
|
||
-------------- | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters