Skip to content

EssayAuthorshipDetector is a deep learning model built using TensorFlow to accurately classify whether an essay was written by a student or generated by a large language model (LLM). This project was developed as part of a Kaggle competition, leveraging advanced natural language processing techniques to detect the origin of written content.

Notifications You must be signed in to change notification settings

gvnberaldi/EssayAuthorshipDetector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

EssayAuthorshipDetector

EssayAuthorshipDetector is a deep learning project designed to classify whether an essay was written by a student or generated by a large language model (LLM). This model was developed as part of the Kaggle competition LLM - Detect AI Generated Text. The project combines advanced natural language processing and machine learning techniques to achieve robust authorship classification.

For a detailed explanation of the methodologies, preprocessing steps, model architecture, and training process, please refer to the report included in this repository.

Table of Contents

  1. Problem Statement
  2. Dataset
  3. Data Preprocessing
  4. Model Architecture
  5. Training and Experiments
  6. Results
  7. Conclusion

Problem Statement

The increasing use of large language models in content generation has created a need for tools to distinguish human-written text from AI-generated text. This project addresses the challenge by building a binary classifier capable of identifying the origin of essays, with significant implications for educational and other settings.

Dataset

The model was trained and evaluated on the DAIGT-V4-TRAIN-DATASET from Kaggle. The dataset comprises 73,573 text samples:

  • 46,203 AI-generated texts
  • 27,370 human-written texts

Text lengths primarily range between 300 and 400 words, with only the text and class label columns used for training.

Class Distribution

Data Preprocessing

To convert text into a numerical representation suitable for model training, two distinct approaches were used:

1. TF-IDF Encoding

  • Texts were transformed into 20,000-dimensional vectors using bigrams and the TF-IDF (Term Frequency-Inverse Document Frequency) technique.
  • This approach focuses on statistical regularities in the text.

2. Sequence Model Encoding

  • Texts were tokenized into word-level sequences and truncated to a length of 350 tokens.
  • Pre-trained GloVe embeddings were used to represent each word, combined with positional embeddings to preserve word order.

Preprocessing Pipeline

Model Architecture

Two models were used in this project:

1. TF-IDF Classifier

  • A fully connected neural network with dropout layers to prevent overfitting.
  • Processes TF-IDF-encoded text and outputs class probabilities.

TF-IDF Classifier Architecture

2. Attentioned CNN-BiLSTM

  • Combines convolutional layers for extracting local features (n-grams) and a BiLSTM for capturing long-range dependencies.
  • Incorporates GloVe embeddings, positional encoding, and self-attention for context-aware token representation.

Attentioned CNN-BiLSTM Architecture

Final Ensemble Model

  • Merges the predictions of the TF-IDF Classifier and Attentioned CNN-BiLSTM using a dense layer trained to perform weighted averaging.

Ensemble Model

Training and Experiments

Data Split

  • 80% training data, 20% test data
  • A validation set (20% of the training set) was used for hyperparameter tuning.

Hyperparameter Tuning

  • Grid search was conducted using the Keras Tuner library.

Training Process

  • Models were trained using binary cross-entropy loss and Adam optimizer.
  • Early stopping was employed to prevent overfitting.
  • Accuracy and loss metrics were logged to monitor performance.

Results

Final Ensemble Model Evaluation

  • ROC Curve

    ROC Curve Ensemble

  • Precision-Recall Curve:

    PR Curve Ensemble

  • Confusion Matrix:

    Confusion Matrix

Conclusion

This project demonstrates the potential of deep learning in authorship detection. While achieving promising results, future improvements could include:

  • Expanding dataset size and diversity.
  • Using pre-trained language models like BERT or GPT.
  • Exploring alternative embedding techniques.
  • Experimenting with advanced regularization and preprocessing methods.

About

EssayAuthorshipDetector is a deep learning model built using TensorFlow to accurately classify whether an essay was written by a student or generated by a large language model (LLM). This project was developed as part of a Kaggle competition, leveraging advanced natural language processing techniques to detect the origin of written content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published