Skip to content

๐Ÿ”  Deep Learning for Natural Language Processing

Notifications You must be signed in to change notification settings

javiabellan/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

99 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Natural Language Processing

Organizar cรณdigo como https://github.com/graykode/nlp-tutorial

Post with usful links: transformers are gnns

Index

Theory
Applications

Resources


Theory

๐Ÿ›  Pipeline

  1. Preprocess
    • Tokenization: Split the text into sentences and the sentences into words.
    • Lowercasing: Usually done in Tokenization
    • Punctuation removal: Remove words like ., ,, :. Usually done in Tokenization
    • Stopwords removal: Remove words like and, the, him. Done in the past.
    • Lemmatization: Verbs to root form: organizes, will organize organizing โ†’ organize This is better.
    • Stemming: Nouns to root form: democratic, democratization โ†’ democracy. This is faster.
  2. Extract features
    • Document features
      • Bag of Words (BoW): Counts how many times a word appears in a text. (It can be normalize by text lenght)
      • TF-IDF: Measures relevance for each word in a document, not frequency like BoW.
      • N-gram: Probability of N words together.
      • Sentence and document vectors. paper2014, paper2017
    • Word features
      • Word Vectors: Unique representation for every word (independent of its context).
        • Word2Vec: By Google in 2013
        • GloVe: By Standford
        • FastText: By Facebook
      • Contextualized Word Vectors: Good for polysemic words (meaning depend of its context).
        • CoVE: in 2017
        • ELMO: Done with with bidirectional LSTMs. By allen Institute in 2018
        • Transformer encoder: Done with with self-attention. โญ
  3. Build model
    • Bag of Embeddings
    • Linear algebra/matrix decomposition
      • Latent Semantic Analysis (LSA) that uses Singular Value Decomposition (SVD).
      • Non-negative Matrix Factorization (NMF)
      • Latent Dirichlet Allocation (LDA): Good for BoW
    • Neural nets
      • Recurrent NNs decoder (LSTM, GRU)
      • Transformer decoder (GPT, BERT, ...) โญ
    • Hidden Markov Models

Others

  • Regular expressions: (Regex) Find patterns.
  • Parse trees: Syntax od a sentence

๐Ÿ”ค Tokenization: The input representation

  • Character tokenization
  • Subword tokenization The best, used in recent models. โญ
  • Word tokenization: Used in traditional NLP.

BPE tokenization of the word _subwords

N-gram

Probability of N words together. Read this.

Example

Toy corpus:

  • <start> I like apples <end>
  • <start> I like oranges <end>
  • <start> I do not like broccoli <end>

Then:

  • P(<start> I like) = P(I | <start>) * P(like | I) = 1 * 0.66 = 0.66
  • P(<start> I like apples) = P(I | <start>) * P(like | I) * P(apples | like) = 1 * 0.66 * 0.5 = 0.33

๐Ÿ”ฎ Recurrent & Convolutional models

  • RNN: Recurrent Nets. No parallel tokens โ˜น๏ธ
    • GRU
    • LSTM
      • AWD-LSTM: regular LSTM with tuned dropout hyper-parameters.
  • CNN: Convolutional Nets. Parallel tokens ๐Ÿ™‚

  • Tricks
    • Teacher forcing: Feed to the decoder the correct previous word, insted of the predicted previous word (at the beggining of training)
    • Attention: Learns weights to perform a weighted average of the words embeddings.

๐Ÿ”ฎ Transformers models

Self-Attention
(Transformer Encoder)
Masked Self-Attention
(Transformer Decoder)
Advantage Context on both sides Auto-Regression
Pretraining Bidirectional LM (better) Unidirectional LM
Examples BERT GPT, GPT-2
Best one ALBERT ? T5, Meena ?
Applications Clasification Text generation

Notes

  • Auto-Regression is when the final output token becomes input.
  • Original transformer combines both encoder and decoder, (is the only transformer doing this).
  • Transformer-XL is a recurrent transformer decoder.
  • XLNet has both Context on both sides and Auto-Regression.
  • ๐Ÿค— Huggingface transformers is a package with pretrained transformers models (PyTorch & Tensorflow).
Model Creator Date Breif description ๐Ÿค—
1st Transfor. Google Jun. 2017 Transforer encoder & decoder
ULMFiT Fast.ai Jan. 2018 Regular LSTM
ELMo AllenNLP Feb. 2018 Bidirectional LSTM
GPT OpenAI Jun. 2018 Transformer decoder on LM โœ”
BERT Google Oct. 2018 Transformer encoder on MLM (& NSP) โœ”
TransformerXL Google Jan. 2019 Recurrent transformer decoder โœ”
XLM/mBERT Facebook Jan. 2019 Multilingual LM โœ”
Transf. ELMo AllenNLP Jan. 2019
GPT-2 OpenAI Feb. 2019 Good text generation โœ”
ERNIE Baidu Apr. 2019
ERNIE Tsinghua May. 2019 Transformer with Knowledge Graph
XLNet Google Jun. 2019 BERT + Transformer-XL โœ”
RoBERTa Facebook Jul. 2019 BERT without NSP โœ”
DistilBERT Hug. Face Aug. 2019 Compressed BERT โœ”
MiniBERT Google Aug. 2019 Compressed BERT
MultiFiT Fast.ai Sep. 2019 Multi-lingual ULMFiT (QRNN) post
CTRL Salesforce Sep. 2019 Controllable text generation โœ”
MegatronLM Nvidia Sep. 2019 Big models with parallel training
ALBERT Google Sep. 2019 Reduce BERT params (param sharing) โœ”
DistilGPT-2 Hug. Face Oct. 2019 Compressed GPT-2 โœ”
T5 Google Oct. 2019 Text-to-Text Transfer Transformer โœ”
ELECTRA ? Dec. 2019 An efficient LM pretraining
Reformer Google Jan. 2020 The Efficient Transformer
Meena Google Jan. 2020 A Human-like Open-Domain Chatbot
Model 2L 3L 6L 12L 18L 24L 36L 48L 54L 72L
1st Transformer yes
ULMFiT yes
ELMo yes
GPT 110M
BERT 110M 340M
Transformer-XL 257M
XLM/mBERT Yes Yes
Transf. ELMo
GPT-2 117M 345M 762M 1542M
ERNIE Yes
XLNet: 110M 340M
RoBERTa 125M 355M
MegatronLM 355M 2500M 8300M
DistilBERT 66M
MiniBERT Yes
ALBERT
CTRL 1630M
DistilGPT-2 82M

URGENT:

Transformer architecture

Transformer input

  1. Tokenizer: Create subword tokens. Methods: BPE...
  2. Embedding: Create vectors for each token. Sum of:
    • Token Embedding
    • Positional Encoding: Information about tokens order (e.g. sinusoidal function).
  3. Dropout

Transformer blocks (6, 12, 24,...)

  1. Normalization
  2. Multi-head attention layer (with a left-to-right attention mask)
    • Each attention head uses self attention to process each token input conditioned on the other input tokens.
    • Left-to-right attention mask ensures that only attends to the positions that precede it to the left.
  3. Normalization
  4. Feed forward layers:
    1. Linear Hโ†’4H
    2. GeLU activation func
    3. Linear 4Hโ†’H

Transformer output

  1. Normalization
  2. Output embedding
  3. Softmax
  4. Label smothing: Ground truth -> 90% the correct word, and the rest 10% divided on the other words.
  • Lowest layers: morphology
  • Middle layers: syntax
  • Highest layers: Task-specific semantics

๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Transfer Learning

Step Task Data Who do this?
1 Language Model Pretraining ๐Ÿ“š Lot of text corpus (eg. Wikipedia) ๐Ÿญ Google or Facebook
2 Language Model Finetunning ๐Ÿ“— Only you domain text corpus ๐Ÿ’ป You
3 Your supervised task ๐Ÿ“—๐Ÿท๏ธ You labeled domain text ๐Ÿ’ป You

๐Ÿ“‰ Losses

  • Language modeling: we project the hidden-state on the word embedding matrix to get logits and apply a cross-entropy loss on the portion of the target corresponding to the gold reply.
  • Next-sentence prediction: we pass the hidden-state of the last token (the end-of-sequence token) through a linear layer to get a score and apply a cross-entropy loss to classify correctly a gold answer among distractors.

๐Ÿ“ Metrics

Score Description Interpretation
Perplexity LM The lower the better.
GLUE An avergae of different scores for NLU
BLEU For Translation. Compare generated with reference sentences (N-gram) The higher the better.
RACE ReAding Comprehension dataset collected from English Examinations The higher the better.
SQuAD Stanford Question Answering Dataset The higher the better.

BLEU limitation

"He ate the apple" & "He ate the potato" has the same BLEU score.

BLEU at your own risk


Applications

Application Description Type
๐Ÿท๏ธ Part-of-speech tagging (POS) Identify nouns, verbs, adjectives, etc. (aka Parsing). ๐Ÿ”ค
๐Ÿ“ Named entity recognition (NER) Identify names, organizations, locations, medical codes, etc. ๐Ÿ”ค
๐Ÿ‘ฆ๐Ÿปโ“ Coreference Resolution Identify several ocuurences on the same person/objet like he, she ๐Ÿ”ค
๐Ÿ” Text categorization Identify topics present in a text (sports, politics, etc). ๐Ÿ”ค
โ“ Question answering Answer questions of a given text (SQuAD, DROP dataset). ๐Ÿ’ญ
๐Ÿ‘๐Ÿผ ๐Ÿ‘Ž๐Ÿผ Sentiment analysis Possitive or negative comment/review classification. ๐Ÿ’ญ
๐Ÿ”ฎ Language Modeling (LM) Predict the next word. Unupervised. ๐Ÿ’ญ
๐Ÿ”ฎ Masked Language Modeling (MLM) Predict the omitted words. Unupervised. ๐Ÿ’ญ
๐Ÿ“—โ†’๐Ÿ“„ Summarization Crate a short version of a text. ๐Ÿ’ญ
๐Ÿˆฏโ†’๐Ÿ†— Translation Translate into a different language. ๐Ÿ’ญ
๐Ÿ†“โ†’๐Ÿ†’ Chatbot Interact in a conversation. ๐Ÿ’ญ
๐Ÿ’๐Ÿปโ†’๐Ÿ”  Speech recognition Speech to text. See AUDIO cheatsheet. ๐Ÿ—ฃ๏ธ
๐Ÿ” โ†’๐Ÿ’๐Ÿป Speech generation Text to speech. See AUDIO cheatsheet. ๐Ÿ—ฃ๏ธ
  • ๐Ÿ”ค: Natural Language Processing (NLP)
  • ๐Ÿ’ญ: Natural Language Understanding (NLU)
  • ๐Ÿ—ฃ๏ธ: Speech and sound (speak and listen)

๐Ÿˆฏ Translation

๐Ÿ“‹ Summarization

๐Ÿค– Chatbot

Model backbone: Transformer decoder like GPT or GPT2 (pretrained for LM).

Input data

  1. Persona: One or several personality sentences. (BLUE)
  2. History: The history of the dialog. (PINK)
  3. Reply: The tokens of the current answer. (GREEN)

Embeddings

  • Word embedding: Information about word semantics.
  • Position embedding: Information about word order.
  • Segment embedding: nformation about type (personality, history or reply).

Double Heads Model for multi-task loss

  • One head for language modeling loss.
  • Other head for next-sentence classification loss.

References

About

๐Ÿ”  Deep Learning for Natural Language Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published