Identifying Crosswriters’ Altering Style in Books for Children and Adults Using Supervised Machine Learning
This repo contains the code (not the data!) written as part of the Computational Literary Studies (CLS) final project at the University of Antwerp.
The objective was to identify the differences (if any) in the writing style of authors who write books for children and adults ("crosswriters") by only focusing on content words.
The paper is available in PDF format.
Dependencies
- Standard:
__future__
os
re
glob
typing
- Third-party:
- Numpy
- Pandas
- Matplotlib
- Seaborn
- Scikit-Learn
- Transformers
- pprint
Environment
- Windows 11 + WSL
- Python 3.9.12 (virtualenv)
Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognisable, and unique ways (Laramée, 2018). Identifying the similarities and differences in style, content, and genre between literature intended for children and adults has always been under the radar of researchers in the field of Computational Literary Studies. However, only recently has examining the implications of cross-writing (i.e., writing works for various readership age groups) gotten attention. In this study, supervised machine learning methods were applied to get a better understanding on whether and how such authors (“crosswriters”) alter their style when targeting a different age group, based entirely on content words. The study was conducted on 5 English authors, and the SVM models reach an F1 macro score of .73 when predicting the age group using all texts and .93 on average for each of the authors individually. To achieve these results, it was essential to overcome the issue of overfitting on the characters of the stories, which was dealt with by (a) implementing a Named Entity Recognition (NER) step in the preprocessing pipeline; and (b) leaving at least one book by each author out of the train set entirely in each of the folds during Cross-Validation.
The authors whose texts were examined are:
- David Almond
- Anna Fine
- Neil Gaiman
- Philip Pullman
- J.K. Rowling
The images are light and dark-mode aware! Check it out through your appearance settings.
Number of books per gender of authors:
Number of books per reader age group:
Number of segments per author and reader age group:
Distribution of total words:
Type-token ratio:
Pre-NER | ~ | Post-NER | ~ | |
---|---|---|---|---|
Acc. | F1 | Acc. | F1 | |
David Almond | .914 | .795 | .933 | .857 |
Anne Fine | .843 | .800 | .979 | .976 |
Neil Gaiman | .939 | .928 | .931 | .916 |
Phillip Pullman | .918 | .771 | .962 | .906 |
J.K. Rowling | .991 | .991 | .999 | .999 |
All authors | .764 | .680 | .788 | .734 |
@article{
title = {Identifying Crosswriters' Altering Style in Books for Children and Adults Using Supervised Machine Learning},
author = {{Dimitris Boumparis}},
organization = {{University of Antwerp}},
year = {2022},
url = {https://github.com/dimboump/crosswriters}
}