Skip to content

The official repository for the paper "Paraphrase Detection: Human vs. Machine Content".

License

Notifications You must be signed in to change notification settings

jonas-becker/pd-human-vs-machine-content

Repository files navigation

Paraphrase Detection: Human vs. Machine Content

arXiv

This is the official repository for the paper Paraphrase Detection: Human vs. Machine Content.

Setup

We recommend using Python 3.10 for this project.

First install the requirements: pip install -r requirements.txt


To use GloVe and Fasttext, you need to place their corresponding pre-trained word vectors into the models directory.

  • GloVe: Get the glove.6B.11d.txt from here.
  • Fasttext: Get the cc.en.300.bin from here.

Experiments

The project has multiple scripts included, each used for separate parts of the experiment.

  1. Parse datasets from the datasets folder to a unified json format: parse.py
  2. Create the BERT embeddings for text pairs in true_data.json and visualize them with t-SNE: embedding_handler.py
  3. Apply detection methods (training & testing): detect_paraphrases.py
  4. Evaluate the detection results: evaluate.py
  5. Get examples sorted by best / worst / random performance: get_examples.py

Datasets

Not all datasets used in the paper are freely available to the public which is why we do not offer the prediction results on text pairs from these datasets for download. However, you are free to reprocess the experiments using all datasets from the paper once you got access.

This study includes twelve datasets (seven human-generated and five machine-generated). For further information, please refer to the paper.

Human-generated datasets: ETPC, QQP, TURL, SaR, MSCOCO, ParaSCI, APH

Machine-generated datasets: MPC, SAv2, ParaNMT-50M, PAWS-Wiki, APT

Results

We evaluated the results of our experiments in the linked paper above. However, we provide additional material here that was not used in the final version of the paper.

t-SNE visualizations of each datasets BERT embeddings
Dataset Aquisition Type Mixed Paraphrases Only
APH Human Live View Live View
APT Machine Live View Live View
ETPC Human Live View Live View
MPC Machine Live View Live View
MSCOCO Human Live View Live View
PAWS-Wiki Machine Live View Live View
ParaNMT-50M Machine Live View Live View
ParaSCI Human Live View Live View
QQP Human Live View Live View
SAv2 Machine Live View Live View
SaR Human Live View Live View
TURL Human Live View Live View
*All Datasets* Mixed Live View Live View
Grid Search Results We performed a 2-fold randomized grid search of 25 iterations once per detection method. The grid search results can be seen in this directory.
One-on-one correlation graphs of detection methods For a detailed view at each one-on-one correlation, please refer to this directory.

Citation

If you use this repository or our paper for your research work, please cite us in the following way.

@misc{becker2023paraphrase,
      title={Paraphrase Detection: Human vs. Machine Content}, 
      author={Jonas Becker and Jan Philip Wahle and Terry Ruas and Bela Gipp},
      year={2023},
      eprint={2303.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}