EssayAuthorshipDetector

EssayAuthorshipDetector is a deep learning project designed to classify whether an essay was written by a student or generated by a large language model (LLM). This model was developed as part of the Kaggle competition LLM - Detect AI Generated Text. The project combines advanced natural language processing and machine learning techniques to achieve robust authorship classification.

For a detailed explanation of the methodologies, preprocessing steps, model architecture, and training process, please refer to the report included in this repository.

Problem Statement

The increasing use of large language models in content generation has created a need for tools to distinguish human-written text from AI-generated text. This project addresses the challenge by building a binary classifier capable of identifying the origin of essays, with significant implications for educational and other settings.

Dataset

The model was trained and evaluated on the DAIGT-V4-TRAIN-DATASET from Kaggle. The dataset comprises 73,573 text samples:

46,203 AI-generated texts
27,370 human-written texts

Text lengths primarily range between 300 and 400 words, with only the text and class label columns used for training.

Data Preprocessing

To convert text into a numerical representation suitable for model training, two distinct approaches were used:

1. TF-IDF Encoding

Texts were transformed into 20,000-dimensional vectors using bigrams and the TF-IDF (Term Frequency-Inverse Document Frequency) technique.
This approach focuses on statistical regularities in the text.

2. Sequence Model Encoding

Texts were tokenized into word-level sequences and truncated to a length of 350 tokens.
Pre-trained GloVe embeddings were used to represent each word, combined with positional embeddings to preserve word order.

Model Architecture

Two models were used in this project:

1. TF-IDF Classifier

A fully connected neural network with dropout layers to prevent overfitting.
Processes TF-IDF-encoded text and outputs class probabilities.

2. Attentioned CNN-BiLSTM

Combines convolutional layers for extracting local features (n-grams) and a BiLSTM for capturing long-range dependencies.
Incorporates GloVe embeddings, positional encoding, and self-attention for context-aware token representation.

Final Ensemble Model

Merges the predictions of the TF-IDF Classifier and Attentioned CNN-BiLSTM using a dense layer trained to perform weighted averaging.

Training and Experiments

Data Split

80% training data, 20% test data
A validation set (20% of the training set) was used for hyperparameter tuning.

Hyperparameter Tuning

Grid search was conducted using the Keras Tuner library.

Training Process

Models were trained using binary cross-entropy loss and Adam optimizer.
Early stopping was employed to prevent overfitting.
Accuracy and loss metrics were logged to monitor performance.

Results

Final Ensemble Model Evaluation

ROC Curve
Precision-Recall Curve:
Confusion Matrix:

Conclusion

This project demonstrates the potential of deep learning in authorship detection. While achieving promising results, future improvements could include:

Expanding dataset size and diversity.
Using pre-trained language models like BERT or GPT.
Exploring alternative embedding techniques.
Experimenting with advanced regularization and preprocessing methods.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs		docs
README.md		README.md
essay_authorship_detector.ipynb		essay_authorship_detector.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EssayAuthorshipDetector

Table of Contents

Problem Statement

Dataset