Authorship Attribution with Random Forests and TFIDF Scores
This repository contains code for the blog post Large Scale Authorship Attribution with Machine Learning. It uses a Random Forest model along with TFIDF scores as features to perform authorship classification among n number of authors.
Path | Description |
---|---|
Authorship-Attribution | Main folder. |
└ sample_data | Folder containing data for authors. |
├ authors_folders | One folder for each author. |
├ authors_article_0.txt | First article of the author. |
├ authors_article_1.txt | Second article. |
├ ... authors_article_n.txt | ... Last article. |
├ attribution_model.py | Authorship attribution model. |
You will need to install the following package to run the authorship attribution model.
- Scikit-learn
In order to run the model, please use the following command:
python3 attribution_model.py --articles_per_author 250 --authors_to_keep 5 --data_folder sample_data
The script takes three parameters as inputs:
- articles_per_author: How many articles do you want to use per author. The range can be anywhere between [10-Maximum Number of Articles per any Author]
- authors_to_keep: How many authors do you want in your attribution classifier. The range can be anywhere between [2-Total Authors]
- data_folder: Data folder containing a single directory for each author.
Copyright (c) 2020-present, Faizan Ahmad