-
GLUE (General Language Understanding Evaluation benchmark)
-
MRPC (Microsoft Research Paraphrase Corpus)
The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically retrieved from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent.
-
CoLA (The Corpus of Linguistic Acceptability)
The corpus of linguistic acceptability consists of judgments about the acceptability of the English language taken from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is grammatically an English sentence.
-
QQP (Quora Question Pairs)
The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.
-
STS (The Semantic Textual Similarity Benchmark)
The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 0 to 5.
-
PAWS (Paraphrase Adversaries from Word Scrambling)
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
-
PAWS-x
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
-
PIT (Paraphrase and Semantic Similarity in Twitter)
Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.
-
SciTail
The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.
-
TURL (Twitter News URL Corpus)
Requires Access
Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.
-
CQADupStack
CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.
-
Paralex
Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.
-
Benchmark for Neural Paraphrase Detection
This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.
-
Notifications
You must be signed in to change notification settings - Fork 0
otanadzetsotne/paraphrase_datasets
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Paraphrase Datasets: contains researches and links to datasets that can be used to sentence paraphrase model training
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published