Skip to content

Latest commit

 

History

History
135 lines (119 loc) · 9.19 KB

nlp-papers.md

File metadata and controls

135 lines (119 loc) · 9.19 KB

NLP

  • ALBERT:A Lite BERT for Self-Supervised Learning of Language Representations (2020)

    • proposed 2 parameter reduction techniques for BERT and achieve better performance
      • Factorized embedding parameterization: decomposing embeddings into two smaller matrices
    • Cross-layer parameter sharing: sharing parameters across FFN layers and attention layers
    • Instead of Next-sentence prediction of BERT, proposed sentence-order prediction
      • two consecutive segments form the same document as positive
    • swapped version of the positive example as negative
    • Overall: with less parameters (70%) than BERT, improvements on tasks SQuAD, MNLI, SST-2, RACE
  • BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding (2018)

    • Language model representation with multilayer bidirectional transformer encoder
    • Two steps: pre-training, fine-tuning
    • Pre-training: train the model on unlabeled data
      • Masked LM: randomly mask tokens and predict masked ones
      • Next Sentence Prediction: to understand sentence relations, predict if a sentence is the actual next sentence or not
    • Fine-tuning: for a task, plug the task specific input/output and fine-tune all parameters
      • QA: input question and passage as a single pack, tested in Squad v1.1 and v2.0
      • GLUE, SWAG
  • Improving Language Understandingby Generative Pre-Training (2018)

    • Semi-supervised learning for language understandung with multilayer Transformer decoders
    • Unsupervised pre-training with multilayer Transformers instead of LSTM, to capture longer range linguistic structure
    • Supervised fine-tuning on a specific task as QA, NLI, classification etc.
  • Efficient Estimation of Word Representations in Vector Space (2013)

    • Learning vector representation with neural network architectures
    • Two novel architecture is proposed:
      • CBOW: predicts the current word based on the context
      • SkipGram: predicts the surrounding words given the current word
    • Tested on several semantic and syntactic tasks

Machine translation

  • Neural machine translation by jointly learning to align and translate (2016)

    • Previous models: RNN encoder-decoder
      • An encoder reads the input into a vector c
    • A decoder predicts the next word given the context vector c and previously generated words
    • Proposed model: Bidirectional RNN encoder-decoder
      • Encoder: BiRNN. Annotation of each word is the concatenation of the forward and backward hidden states
      • Decoder: BiRNN.
        • The probability is conditioned on a distinct vector c_i for each target word y_i
        • A vector c_i depends on a sequence of annotations, each annotation contains information about the whole input sequence
        • Alignment model: score is based on the RNN hidden state and the annotation of the input sentence
    • Experiment: WMT'14 English-French
  • Pointing the Unknown Words (2016)

    • Proposed method: Attention based model with two softmax layers to deal with rare/unknown words
    • Baseline: Neural Translation Model with attention
    • Pointer Softmax (PS):
      • can predict whether it is necessary to use the pointing
      • can point any location of the context sequence (length varies)
      • Two softmax Layer :
        • Shortlist Layer: Typical softmax layer with shortlist
        • Location Layer: Pointer network, points location in the context
      • At each time step, if the model chooses the shortlist layer a word is generated; if the model chooses the location layer a context word's location is obtained.
      • A switching network to decide with layer to use, a binary variable trained MLP
    • Experiments: Summarization with Gigaword, Translation wirh Europarl
    • Slight improvements on both tasks

Text Summarization

  • Get To The Point: Summarization with Pointer-Generator Networks (2017)

    • Pointer generator model with coverage
    • Baseline is sequence to sequence attention model
    • The PointerGenerator model is a hybrid between the baseline and the pointer networks
    • The covarge is added to overcome the repetition problem of seq2seq models
    • Dataset: CNN/Daily Mail
    • PointerGenerator is better than baseline and the abstractive models, but extractive models are still better on Rouge score.
  • Abstractive Sentence Summarization with Attentive Recurrent Neural Networks (2016)

    • A convolutional attention-based conditional RNN
    • The model called Recurrent Attentive Summarizer (RAS)
    • The model can be seen as an extension of Rush et al. 2015 (ABS)
      • Different from ABS, the encoder is a RNN.
    • RAS has a recurrent decoder,an attentive encoder and a beam search
    • Dataset: Gigaword, DUC 2004
    • RAS achieves better results than ABS
  • Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (2016)

    • Attentional encoder-decoder RNN
    • To handle the bottleneck at softmax: the decoder vocabulary is restricted to the words in the source documents at each minibatch
    • Feature rich encoder: TF, IDF, Pos, NER added to the word features
    • a switch added to indicate either choose from the source document or choose from the vocabulary
    • Dataset: Gigaword, DUC, CNN/Daily Mail
  • A Neural Attention Model for Abstractive Sentence Summarization (2015)

    • Model generates each word of the summary conditioned on the input sentence.
    • From all possible summaries, model finds the summary that have the max probability given that previously generated words and input.
    • Model consists of three components: Neural language model,encoder and summary generator
      • Neural Language Model: adapted from standard feed-forward LM.
      • Encoder: Attention-based encoder
      • Generation: beam-search
      • Training: loss function is negative log likelihood
    • Experiments: DUC 2003-2004, Gigaword

Question-Answering / Reading Comprehension

  • R-NET: Machine Reading Comprehension with Self-matching Networks (2017)

    • An end-to-end neural network for reading comprehension and question answering
    • Model consists of :
      • A RNN encoder to build the representation for questions and passage: biRNN with GRU, also character and word embedding concatenated for vector representations
      • Gated attention-based RNN: match the question and passage,generates question-aware passage representation
      • Self-matching attention: matches the question-aware passage representation to itself again
      • Output layer: the pointer network to predict the boundary of the answer in the passage
    • Training: Initialization with Glove, 1-layer biGRU for character embeddings, 3-layer RNN for word embeddings
    • Datasets: Squad, MS-Marco
  • Attention-over-Attention Neural Networks for Reading Comprehension (2017)

    • Task: Cloze-stype RC, there are triples as (Document,Query,Answer)
    • Attention over document-level attention
    • Model consists of:
      • Embedding: shared with document and query, biRNN, GRU
    • Matching between context vectors: Pairwise mathing between one document word and one query word by dot product
    • Document-to-query attention: column-wise softmax applied on the matching matrix
    • Query-to-document attention: row-wise softmax applied on the matching matrix
    • N-best reranking
    • Datasets: CNN/Daily Mail, Children's book RC
  • Machine Comprehension using MatchLSM and Answer Pointer(2017)

    • Data: A passage, embedded into dxP matrix where P is the length of the passage. A question, embedded into dxQ matrix where Q is the length of the question and d is the embedding dim. The answer can be :
      • A sequence of integers which indicate the positions of the answer's words in the passage ==> Sequence Model
      • Two integeres which indicate the start and end positions of the answer in the passage ==> Boundary Model
    • Model has 3 layers: LSTM preprocessing (for embeddings), Match LSTM (shows the degree of matching between a token of a passage and a token of the question), Answer Pointer (based PointerNetworks)
    • Experiments: Initialization with glove, embeddings are not learned
    • Dataset: Squad ==> Results: F1 77%, exact Match 67.6%
  • Pointer Networks (2015)

    • PtrNet is a variation of sequence-to-sequence models with attention
    • Baseline: seq2seq and input-attention models
    • The output of PtrNet is discrete and the length depends on the input's length
    • Problems: Convex Hull, Delaunay Triangulation, TSP
    • Additional Info: Introduction to pointer networks