Identifying the author of a text poses a challenge for humans, yet algorithms offer a promising solution by discerning patterns in writing styles. This study uses natural language processing (NLP) and machine learning techniques to predict 10 authors from their 100 works, achieving an 89% predictive accuracy.
A total of 100 English works by 10 authors were downloaded from
Gutenberg using the gutenbergr
as detailed in 1-gutenberg_download.R. 10
authors are: Jane Austen, Agatha Christie, Charles Dickens, Daniel
Defoe, Arthur Conan Doyle, George Eliot, Jack London, William
Shakespeare, Mark Twain, and Oscar Wilde.
Data preprocessing was conducted using NLP and the NLTK
package, as
outlined in 2-NLP.ipynb. The following steps were
- Lowercasing: Converting all letters to lowercase, as uppercase and lowercase letters are typically treated the same.
- Example: “Retailers and 10 Ice-creams” becomes “retailers and 10 ice-creams”.
- Tokenization: Splitting text into individual words and chunks.
- Example: “retailers and 10 ice-creams” becomes “retailers”, “and”, “10”, “ice-creams”.
- Stop Word Removal: Removing common words such as “the”, “and”, “in” as they do not carry useful information.
- Example: “retailers”, “10”, “ice-creams” remains.
- Regular Expressions: Retaining only English letters, removing numbers and punctuation.
- Example: “retailers” remains.
- Stemming and Lemmatization: Reducing words to their root forms for easier analysis.
- Example: “retailers” becomes “retail”.
I converted textual data to numerical features using the sklearn
package so that machine learning algorithms can understand.
Bag-of-Words (BoW) counts occurrences of each word, resulting in a wide vector for a text, where each column represents a word. When multiple texts are stacked together, BoW forms a wide matrix. Below is its head:
aar | aaron | ab | aba | aback | abaft | abalon | abandon | abari | abash | abat | abb | abbalac | abbaratta | abbay | abbess | abbey | abbeyland | abbia | abbiamo |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 40 | 0 | 0 | 0 |
An improvement of BoW is TF-IDF (term frequency–inverse document frequency). Since documents vary in length, 10 occurrences of “apples” in a 1000-word document may not be more important than 8 occurrences in a 10-word document. Therefore, term frequency measures how frequently a word appears in a document. Moreover, “car” may appear very often in every car review, so it may not carry additional information. Thus, words that are common across many documents get a lower inverse document frequency score, while rare words get a higher score. In summary, TF-IDF emphasizes words that are both significant within a specific document and relatively uncommon across the whole set of documents. Below is its head:
aar | aaron | ab | aba | aback | abaft | abalon | abandon | abari | abash | abat | abb | abbalac | abbaratta | abbay | abbess | abbey | abbeyland | abbia | abbiamo |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0.0087742 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0065871 | 0 | 0.0031459 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0.0053308 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0.0005131 | 0.0015409 | 0 | 0 | 0 | 0 | 0 | 0.0004347 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0.0000000 | 0.0007997 | 0 | 0 | 0 | 0 | 0 | 0.0360961 | 0 | 0 | 0 |
Using TF-IDF for further analysis, Principal Component Analysis (PCA) reduced the dimensionality from around 35,000 unique words to 20 principal components.
The outcome variable (y) contains 10 authors, and the feature variables (X) are the 20 principle components. The dataset was split into 80% training data and 20% test data. 7 classification models were evaluated: logistic regression, Lasso, Naive Bayes, KNN, random forest, GBM, and XGBoost. These models were trained on the training data, and their predictions were compared to actual outcomes in the test data.
In logistic regression, I used Softmax function to handle multiple classes in the outcome variable.
Logistic regression with too many features may result in overfitting. Thus, I used Lasso to regularize it. I used 10-fold cross-validation in the training data to find the optimal regularization parameter λ.
Naive Bayes assumes every feature is independent of all other features, conditional on the class labels of the outcome variable.
KNN measures distances between features. I used 10-fold cross-validation in the training data to find the optimal number of neighbors k.
In gradient boosting models (GBM and XGBoost), I used grid search with 5-fold cross-validation in the training data to find optimal parameters.
Below is the overall accuracy, measuring the proportion of accurate predictions in test data. Random Forest achieved the highest accuracy at 0.8947, significantly outperforming random guessing 10 classes of authors.
model | overall_accuracy |
Logistic regression | 0.7368 |
Lasso | 0.7895 |
Naive Bayes | 0.5789 |
KNN | 0.7895 |
Random Forest | 0.8947 |
GBM | 0.8421 |
XGBoost | 0.8421 |
I used K-means to group 100 works into 12 clusters. There is no “optimal” number of clusters k, so I tried it from 2 to 20, finding that 12 had a better performance. In the table below, each row represents a cluster. Notably, Cluster 4 mainly contains works by Daniel Defoe, indicating new works falling into this cluster are likely authored by him.
Austen, Jane | Christie, Agatha | Defoe, Daniel | Dickens, Charles | Doyle, Arthur Conan | Eliot, George | London, Jack | Shakespeare, William | Twain, Mark | Wilde, Oscar |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 3 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
9 | 8 | 0 | 7 | 5 | 3 | 5 | 15 | 5 | 9 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
In a sentiment lexicon like Afinn, positive words have higher sentiment
scores. For example, “good” scores 3 while “bad” scores -3. In each
single document, I multiplied the sentiment score of each word by its
term frequency to balance different lengths of documents, then summed up
all multiplications to get one sentiment score.
From the figure below, my selection of Jane Austen’s works are more
positive, while Agatha Christie’s are more negative.
Word clouds visualized word frequencies for each author (using the
package, saving in the wordcloud folder). “One”
is the most prevalent among all authors. Notably, the famous detective
“Poirot” frequently appears in Agatha Christie’s works. Below is the
word cloud of her works.