- Two novel architectures for computing continuous vector representations of words from large datasets
- Current NLP models treat words as atomic - no notion of similarity between words
- Simplicity rules all - e.g: n-grams which are just indices over a vocabulary
- Huge limitations if no meaning at all isn't conveyed in the representation
- Goal: techniques for learning high-quality word vectors from datasets with BILLIONS of words
- Previous state of the art is low 100s of millions of words with a dimensionality of ~100
- Proposed technique to measure similarity between words
- Moreover, words can have multiple degrees of similarity
- Algebraic operators like
vector('King') - vector('Man') + vector('Woman') ~= vector('Queen')
- Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are discarded as they are computationally very expensive
- Maximize the accuracy while minimizing the complexity
- Input: N previous words are encoded using 1-of-V encoding where V is the size of the vocabulary
- Projected to a layer P that has dimensionality N x D
- HxV dominates the computation (H being the hiden layer size)
- Q = N × D + N × D × H + H × V
- Huffman encoding helps with making this representation fast
- No need for projection layer P
- Q = H × H + H × V
- Continuous Bag Of Words model (CBOW)
- Similar to NNLM
- Continuous skip-gram
- Instead of predicting the current word based on context, maximize classification of word based on another word in the same sentence
- Use each current word as input to log-linear classifier with a continuous projction layer and predict
- Predict words within a range before and after