author-prediction-full-project

Author prediction using the Gutenberg Database

The goal is to create an algorithm that is able to predict the author of a text.

This repository is organized in submodules, with the different steps of the project:

We need to download the database. We use ebooks from the Gutenberg database. The download_gutenberg repository reads from the database index and downloads specific files (for example specific authors) from the Gutenberg website. You can find here link the result of this script. The data/ folder should be next to the download_gutenberg folder.
We want to separate books into blocks and to extract features from the text. We use the frequency of most common words and the length of sentences. You can find the code for this in the author_classification submodule. (an exemple feature extraction is provided in the file apprentissage_10.txt).
The submodule author_classification_fs computes a PCA feature selection on the previous output.
The submodule ml_on_author_prediction implements different methods of machine learning on the resulting dataset. (KNN and Random Forest). The Camembert/ creates a little chart with the results.

The demo:

The demo runs a part of the pipeline. We suppose that features are already extracted from the database and that we are given a new text to predict (the test text).

We extract the features from the new text.
We make the PCA on the dataset and apply the same transformation to the test data.
The machine learning algorithms learn on the dataset and predict the test data.
The results are displayed in a chart.

How to run the demo ?

Clone me with: git clone --recursive https://github.com/raph-m/author-prediction-full-project.git
Move the files apprentissage_10.txt and 1013.txt to the parent directory (so next to this repository).
Open the full-project.pro file with Qt and run the project.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
author_classification @ a9b157c		author_classification @ a9b157c
author_classification_fs @ 3a11c80		author_classification_fs @ 3a11c80
download_gutenberg @ 0632272		download_gutenberg @ 0632272
main		main
ml_on_author_prediction @ d9819d2		ml_on_author_prediction @ d9819d2
.gitignore		.gitignore
.gitmodules		.gitmodules
1013.txt		1013.txt
14654.txt		14654.txt
LICENSE		LICENSE
README.md		README.md
apprentissage_10.txt		apprentissage_10.txt
full-project.pro		full-project.pro
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

author-prediction-full-project

The demo:

How to run the demo ?

About

Releases

Packages

Contributors 2

Languages

License

raph-m/author-prediction-full-project

Folders and files

Latest commit

History

Repository files navigation

author-prediction-full-project

The demo:

How to run the demo ?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages