Skip to content

Latest commit

 

History

History
44 lines (26 loc) · 9.6 KB

README.md

File metadata and controls

44 lines (26 loc) · 9.6 KB

Spooky Author Identification

Kaggle Notebook

Spooky Author Identification (GloVe + LSTM)

Overview

Suppose that we are given a specific text and we only know that the author of the text is one among Edgar Allan Poe $(\text{EAP})$, H. P. Lovecraft $(\text{HPL})$ and Mary Shelley $(\text{MWS})$. How do we predict who wrote the text? More specifically, how to predict the probability that the given text is written by Edgar Allan Poe, and the same for the other two authors?

In this work, we have a large dataset of texts labeled with the true author, who is one among $\text{EAP}$, $\text{HPL}$ and $\text{MWS}$. The data is provided in the Dataset folder of the repository. However, in the notebook, it is imported directly from the Spooky Author Identification Kaggle competition. The objective is to train a model to predict probabilities that a given new text is written by $X$, where $X$ = $\text{EAP}$, $\text{HPL}$ and $\text{MWS}$. We assume that the new text is indeed written by one of the authors, so that the three probabilities add up to $1$. This immediately helps us in classifying the given text as written by a specific author, for instance, we can choose the author with the highest probability of writing the text as a prediction.

We use this problem to illustrate the use of two relevant techniques: GloVe model for word vectorizations and long short-term memory (LSTM) neural network for model building. The steps in this notebook towards the mentioned objective are as follows:

  • We define the multiclass_log_loss function, which takes in a matrix of binarized true target classes y_true_binarized, a matrix of predicted class probabilities y_pred_probabilities and a clipping parameter epsilon, and produce the multiclass version of the log loss metric between y_true_binarized and y_pred_probabilities. To utilize this function as loss in model compilation, we use TensorFlow and Keras backend functions to write it, instead of the standard NumPy functions.

  • We split the data in $80:20$ ratio (the training set consisting of $80%$ data, and the validation set consisting of the rest). We stratify the split using the labels, so that the proportion of each label remains roughly the same in the training set and the validation set.

  • We encode the labels $\text{EAP}$, $\text{HPL}$ and $\text{MWS}$ using a dictionary and map them to integer values $0$, $1$ and $2$, respectively; and convert the integer label vectors to binary class matrices, each row of which represents a one-hot vector, corresponding to an integer component of the label vector.

  • We fit Keras tokenizer on the combined list of texts from the training set and the validation set. The obtained words are then indexed by employing the word_index method; convert the texts to sequences of integers using the texts_to_sequences method; and use the pad_sequences function of Keras to pad the sequences to a maximum length to be equal to the smallest integer greater than $m + 2s$, where $m$ and $s$ respectively denote the mean and standard deviation of the text lengths from the combined set of texts from the training set and the validation set. We construct a matrix of vector representations of the words found in the training set and the validation set by mapping the words to a $100$-dimensional vector space through GloVe embedding.

  • We build a sequential model consisting of an embedding layer with weights provided by the matrix of word vectors, constructed previously; a SpatialDropout1D layer; an LSTM layer with number of units same as the length of the GloVe vectors; two dense hidden layers with ReLU activation function, each followed by a dropout layer; and an output layer of three neurons, corresponding to the three probabilities for the three authors, with softmax activation function. The model is compiled with the manually defined multiclass_log_loss function as loss and the Adam optimizer with an initial learning rate of $0.001$, which is then regulated by a manually defined schedule function scheduler_modified_exponential through learning rate scheduler callback to update the learning rate for the optimizer at each epoch.

  • We fit the model on the padded sequences generated from the training texts and the binary class matrix generated from the training labels for a set number of epochs. The training loss and the validation loss is monitored at each epoch and we stop the training procedure once the validation loss stops improving via an early stopping callback. We produce a plot depicting how the training loss and the validation loss evolved over epochs, giving an overall picture of the model building procedure.

  • We employ the trained model to predict the probabilities of the texts, in both the training set and the validation set, being written by the three authors and obtain a training log loss of $0.391$ and validation log loss of $0.581$. The predicted probabilities are then converted to labels by picking the mode and we get a training accuracy of $0.846$ and validation accuracy of $0.764$. Finally, a complete picture of the performance of the trained model on the validation set, in the context of the task of classifying the texts as written by one of the three authors, is provided through a confusion matrix.

Acknowledgements

Further Reading