Natural language inference is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".
Example:
Premise | Label | Hypothesis |
---|---|---|
A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. |
A soccer game with multiple males playing. | entailment | Some men are playing a sport. |
The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.
State-of-the-art results can be seen on the SNLI website.
The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.
Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.
Model | Matched | Mismatched | Paper / Source | Code |
---|---|---|---|---|
RoBERTa (Liu et al., 2019) | 90.8 | 90.2 | RoBERTa: A Robustly Optimized BERT Pretraining Approach | Official |
XLNet-Large (ensemble) (Yang et al., 2019) | 90.2 | 89.8 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Official |
MT-DNN-ensemble (Liu et al., 2019) | 87.9 | 87.4 | Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding | Official |
Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 87.6 | 87.2 | Training Complex Models with Multi-Task Weak Supervision | Official |
Finetuned Transformer LM (Radford et al., 2018) | 82.1 | 81.4 | Improving Language Understanding by Generative Pre-Training | |
Multi-task BiLSTM + Attn (Wang et al., 2018) | 72.2 | 72.1 | GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding | |
GenSen (Subramanian et al., 2018) | 71.4 | 71.3 | Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning |
The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist "in the wild". Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.
Model | Accuracy | Paper / Source |
---|---|---|
Finetuned Transformer LM (Radford et al., 2018) | 88.3 | Improving Language Understanding by Generative Pre-Training |
Hierarchical BiLSTM Max Pooling (Talman et al., 2018) | 86.0 | Natural Language Inference with Hierarchical BiLSTM Max Pooling |
CAFE (Tay et al., 2018) | 83.3 | A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference |