This project addresses the challenge of distinguishing between AI-generated text and human-written content, amidst concerns about academic integrity and misinformation raised by the rise of Generative AI and Large Language Models such as ChatGPT and GPT-4. It entails developing and optimizing neural networks, with a focus on fine-tuning three transformer models from Hugging Face, all of which are RoBERTa-based, to accurately identify AI-generated text. This initiative is critical in combating AI's potential for spreading false information and ensuring the credibility of online content. The project promotes responsible AI use and advances detection methods in the digital realm.
To run the main notebook, open main.ipynb
in a Jupyter notebook environment.
By running the following command:
jupyter notebook main.ipynb
You can also directly access our fine-tuned and ensembled model on hugging face, and try it:
To install the project's dependencies, run the following command:
pip install -r requirements.txt
The base models should download automatically when running the main notebook. However, if you want to download them manually, you can click on the model name and download the following models:
Or you can download them on the following google drive link:
Additionally, link to the pre-data we used:
The data used in this project is located in the Data/
directory. Here's a brief description of each file:
ai.csv
: This file contains data generated by an AI (GPT-3.5).data-prepare.ipynb
: This file is a Jupyter notebook used for data preparation.final_data.csv
: This file contains the final processed data.human.csv
: This file contains data generated by humans.pre-made.csv
: This file contains pre-made data from the internet.youtube_comments.csv
: This file contains data extracted from YouTube comments.The Da Vinci Code.pdf
andThe Diary of a Young Girl.pdf
: These are examples of what we have used as PDF sources.
The Scripts/
directory contains utility scripts used in this project:
utils.py
: This file contains utility functions for handling transformer models. It includes functions to download, load, and save transformer models and their tokenizers (get_model
), make predictions on a batch of sentences (model_predict
), and evaluate model performance by computing accuracy, recall, precision, and F1-score (eval_model
).
This project is not open to contributions at this time. This project is created by Ruşen Birben and Burak Ercan.