-
ALBERT:A Lite BERT for Self-Supervised Learning of Language Representations (2020)
- proposed 2 parameter reduction techniques for BERT and achieve better performance
- Factorized embedding parameterization: decomposing embeddings into two smaller matrices
- Cross-layer parameter sharing: sharing parameters across FFN layers and attention layers
- Instead of Next-sentence prediction of BERT, proposed sentence-order prediction
- two consecutive segments form the same document as positive
- swapped version of the positive example as negative
- Overall: with less parameters (70%) than BERT, improvements on tasks SQuAD, MNLI, SST-2, RACE
- proposed 2 parameter reduction techniques for BERT and achieve better performance
-
BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding (2018)
- Language model representation with multilayer bidirectional transformer encoder
- Two steps: pre-training, fine-tuning
- Pre-training: train the model on unlabeled data
- Masked LM: randomly mask tokens and predict masked ones
- Next Sentence Prediction: to understand sentence relations, predict if a sentence is the actual next sentence or not
- Fine-tuning: for a task, plug the task specific input/output and fine-tune all parameters
- QA: input question and passage as a single pack, tested in Squad v1.1 and v2.0
- GLUE, SWAG
-
Improving Language Understandingby Generative Pre-Training (2018)
- Semi-supervised learning for language understandung with multilayer Transformer decoders
- Unsupervised pre-training with multilayer Transformers instead of LSTM, to capture longer range linguistic structure
- Supervised fine-tuning on a specific task as QA, NLI, classification etc.
-
Efficient Estimation of Word Representations in Vector Space (2013)
- Learning vector representation with neural network architectures
- Two novel architecture is proposed:
- CBOW: predicts the current word based on the context
- SkipGram: predicts the surrounding words given the current word
- Tested on several semantic and syntactic tasks
-
Neural machine translation by jointly learning to align and translate (2016)
- Previous models: RNN encoder-decoder
- An encoder reads the input into a vector c
- A decoder predicts the next word given the context vector c and previously generated words
- Proposed model: Bidirectional RNN encoder-decoder
- Encoder: BiRNN. Annotation of each word is the concatenation of the forward and backward hidden states
- Decoder: BiRNN.
- The probability is conditioned on a distinct vector c_i for each target word y_i
- A vector c_i depends on a sequence of annotations, each annotation contains information about the whole input sequence
- Alignment model: score is based on the RNN hidden state and the annotation of the input sentence
- Experiment: WMT'14 English-French
- Previous models: RNN encoder-decoder
-
Pointing the Unknown Words (2016)
- Proposed method: Attention based model with two softmax layers to deal with rare/unknown words
- Baseline: Neural Translation Model with attention
- Pointer Softmax (PS):
- can predict whether it is necessary to use the pointing
- can point any location of the context sequence (length varies)
- Two softmax Layer :
- Shortlist Layer: Typical softmax layer with shortlist
- Location Layer: Pointer network, points location in the context
- At each time step, if the model chooses the shortlist layer a word is generated; if the model chooses the location layer a context word's location is obtained.
- A switching network to decide with layer to use, a binary variable trained MLP
- Experiments: Summarization with Gigaword, Translation wirh Europarl
- Slight improvements on both tasks
-
Get To The Point: Summarization with Pointer-Generator Networks (2017)
- Pointer generator model with coverage
- Baseline is sequence to sequence attention model
- The PointerGenerator model is a hybrid between the baseline and the pointer networks
- The covarge is added to overcome the repetition problem of seq2seq models
- Dataset: CNN/Daily Mail
- PointerGenerator is better than baseline and the abstractive models, but extractive models are still better on Rouge score.
-
Abstractive Sentence Summarization with Attentive Recurrent Neural Networks (2016)
- A convolutional attention-based conditional RNN
- The model called Recurrent Attentive Summarizer (RAS)
- The model can be seen as an extension of Rush et al. 2015 (ABS)
- Different from ABS, the encoder is a RNN.
- RAS has a recurrent decoder,an attentive encoder and a beam search
- Dataset: Gigaword, DUC 2004
- RAS achieves better results than ABS
-
Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (2016)
- Attentional encoder-decoder RNN
- To handle the bottleneck at softmax: the decoder vocabulary is restricted to the words in the source documents at each minibatch
- Feature rich encoder: TF, IDF, Pos, NER added to the word features
- a switch added to indicate either choose from the source document or choose from the vocabulary
- Dataset: Gigaword, DUC, CNN/Daily Mail
-
A Neural Attention Model for Abstractive Sentence Summarization (2015)
- Model generates each word of the summary conditioned on the input sentence.
- From all possible summaries, model finds the summary that have the max probability given that previously generated words and input.
- Model consists of three components: Neural language model,encoder and summary generator
- Neural Language Model: adapted from standard feed-forward LM.
- Encoder: Attention-based encoder
- Generation: beam-search
- Training: loss function is negative log likelihood
- Experiments: DUC 2003-2004, Gigaword
-
R-NET: Machine Reading Comprehension with Self-matching Networks (2017)
- An end-to-end neural network for reading comprehension and question answering
- Model consists of :
- A RNN encoder to build the representation for questions and passage: biRNN with GRU, also character and word embedding concatenated for vector representations
- Gated attention-based RNN: match the question and passage,generates question-aware passage representation
- Self-matching attention: matches the question-aware passage representation to itself again
- Output layer: the pointer network to predict the boundary of the answer in the passage
- Training: Initialization with Glove, 1-layer biGRU for character embeddings, 3-layer RNN for word embeddings
- Datasets: Squad, MS-Marco
-
Attention-over-Attention Neural Networks for Reading Comprehension (2017)
- Task: Cloze-stype RC, there are triples as (Document,Query,Answer)
- Attention over document-level attention
- Model consists of:
- Embedding: shared with document and query, biRNN, GRU
- Matching between context vectors: Pairwise mathing between one document word and one query word by dot product
- Document-to-query attention: column-wise softmax applied on the matching matrix
- Query-to-document attention: row-wise softmax applied on the matching matrix
- N-best reranking
- Datasets: CNN/Daily Mail, Children's book RC
-
Machine Comprehension using MatchLSM and Answer Pointer(2017)
- Data: A passage, embedded into dxP matrix where P is the length of the passage. A question, embedded into dxQ matrix where Q is the length of the question and d is the embedding dim. The answer can be :
- A sequence of integers which indicate the positions of the answer's words in the passage ==> Sequence Model
- Two integeres which indicate the start and end positions of the answer in the passage ==> Boundary Model
- Model has 3 layers: LSTM preprocessing (for embeddings), Match LSTM (shows the degree of matching between a token of a passage and a token of the question), Answer Pointer (based PointerNetworks)
- Experiments: Initialization with glove, embeddings are not learned
- Dataset: Squad ==> Results: F1 77%, exact Match 67.6%
- Data: A passage, embedded into dxP matrix where P is the length of the passage. A question, embedded into dxQ matrix where Q is the length of the question and d is the embedding dim. The answer can be :
-
- PtrNet is a variation of sequence-to-sequence models with attention
- Baseline: seq2seq and input-attention models
- The output of PtrNet is discrete and the length depends on the input's length
- Problems: Convex Hull, Delaunay Triangulation, TSP
- Additional Info: Introduction to pointer networks