Paraphrasing datasets

GLUE (General Language Understanding Evaluation benchmark)

Home page ->

tensorflow ->

github ->
MRPC (Microsoft Research Paraphrase Corpus)

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically retrieved from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent.

Home page ->

Download ->
CoLA (The Corpus of Linguistic Acceptability)

The corpus of linguistic acceptability consists of judgments about the acceptability of the English language taken from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is grammatically an English sentence.

Home page ->

Download ->
QQP (Quora Question Pairs)

The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

Kaggle ->
STS (The Semantic Textual Similarity Benchmark)

The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 0 to 5.

Home page ->

Download ->
PAWS (Paraphrase Adversaries from Word Scrambling)

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.

paper ->

github ->

Download (Wiki) (размеченный) ->

Download (Wiki) (размеченный, только с перестановками) ->
PAWS-x

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.

github ->

Download ->
PIT (Paraphrase and Semantic Similarity in Twitter)

Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.

github ->
SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.

Home page ->

Paper ->

Download ->
TURL (Twitter News URL Corpus)

Requires Access

Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.

github ->
CQADupStack

CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.

Home page ->

github ->

Download ->
Paralex

Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

Home page ->

Cкачать ->
Benchmark for Neural Paraphrase Detection

This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.

Home page ->

Download ->

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
README.ru.md		README.ru.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paraphrasing datasets

About

Releases

Packages

otanadzetsotne/paraphrase_datasets

Folders and files

Latest commit

History

Repository files navigation

Paraphrasing datasets

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages