11. History of NN architectures in NLP: CNNs, LSTMs

Explanations, formulas, visualisations:

Lena Voita's blog: Text Classification

Goldberg 9, 10, 11

Eisenstein 3.4, 6

Jurafsky-Martin 9

RNNs are designed to process sequences, when the order of symbols is important
Hidden layers of RNNs model the whole sequence, each token in the input is represented with a hidden state
The hidden states can be seen as layers of a deep network (often called units) connected via the RNN abstraction
The RNN abstraction: current state = current input + previous state
Gating is invented to control how much history is allowed to pass to the next state, gated RNN units have more structure than simple RNN unit
by design uni-directional, in NLP mostly used as bi-directional: concatenation of the left-to-right and right-to-left pass
can output a symbol at each step or at the end of the sequence

LSTM is a kind of gating: each state consists of a hidden and a memory (also called context) component
Hidden component is the main unit, memory is additional unit that helps gating
Gates: input, forget, output

designed to identify relevant input features
specifically in NLP: relevant segments of text are found with 1D convolutions
convolution produces one vector representation for each n-gram
each filter is responsible for one cell in the vector, dimensionality of n-gram vector is the number of filters
n-gram representations are then pooled into a single vector representing the whole text
this vector is input to a classifier
n-gram size ≠ filter size!

Typically CNNs are better suited for the sentence-level tasks (e.g. spam filtering, sentiment), while LSTMs for token level tasks (e.g. NER, POS tagging)
LSTMs are also often used for sentence level tasks: the last state is a representation of the whole sequence
The use of CNNs for token-level tasks is possible but not common

Provide feedback