I implemented the paper based on the research methodology
https://arxiv.org/pdf/1412.7782.pdf
Develope an effective plagiarism detection tool for text based assignments by comparing unigram, bigram, and trigram of vector space model with cosine and jaccard similarity measure
- Python 2.7
- scikit-learn
- NLTK
Several important files / directories:
- main.py
Main file containing the whole source code
- docs
A directory containing students answer. Each answer is stored in a document having specified file name, namely assignment_index. The word assignment is fixed and word index is an integer that will be incremented each time a new student is added
- combined_docs
Each student answer will be combined into one document called MASTER Document. The detection processes will be done using this combined document
To run the program, execute the following command:
python main.py
- Combining students answer into one single answer file (MASTER DOCUMENT)
- Extract unique words (unigram, bigram, trigram) from the MASTER DOCUMENT
- Eliminate stopwords
- Compute Document Frequency (DF) and Inverse Document Frequency (IDF) for each term
- Compute TF-IDF Weight Vector for each document
- Compare each pair of assignment using Cosine Similarity
- Compare each pair of assignment using Jaccard Similarity
Albertus Kelvin
Bandung Institute of Technology
Code was developed on January 20th, 2018
Code was made publicly available on January 31st, 2018