Skip to content

Code for final assignment for CLS course at the University of Antwerp (SoSe 2022)

Notifications You must be signed in to change notification settings

dimboump/crosswriters

Repository files navigation

Identifying Crosswriters’ Altering Style in Books for Children and Adults Using Supervised Machine Learning

This repo contains the code (not the data!) written as part of the Computational Literary Studies (CLS) final project at the University of Antwerp.

The objective was to identify the differences (if any) in the writing style of authors who write books for children and adults ("crosswriters") by only focusing on content words.

The paper is available in PDF format.

Development

Dependencies
  • Standard:
    • __future__
    • os
    • re
    • glob
    • typing
  • Third-party:
    • Numpy
    • Pandas
    • Matplotlib
    • Seaborn
    • Scikit-Learn
    • Transformers
    • pprint
Environment
  • Windows 11 + WSL
  • Python 3.9.12 (virtualenv)

Abstract

Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognisable, and unique ways (Laramée, 2018). Identifying the similarities and differences in style, content, and genre between literature intended for children and adults has always been under the radar of researchers in the field of Computational Literary Studies. However, only recently has examining the implications of cross-writing (i.e., writing works for various readership age groups) gotten attention. In this study, supervised machine learning methods were applied to get a better understanding on whether and how such authors (“crosswriters”) alter their style when targeting a different age group, based entirely on content words. The study was conducted on 5 English authors, and the SVM models reach an F1 macro score of .73 when predicting the age group using all texts and .93 on average for each of the authors individually. To achieve these results, it was essential to overcome the issue of overfitting on the characters of the stories, which was dealt with by (a) implementing a Named Entity Recognition (NER) step in the preprocessing pipeline; and (b) leaving at least one book by each author out of the train set entirely in each of the folds during Cross-Validation.

Exploratory Data Analysis

The authors whose texts were examined are:

  • David Almond
  • Anna Fine
  • Neil Gaiman
  • Philip Pullman
  • J.K. Rowling

The images are light and dark-mode aware! Check it out through your appearance settings.

Corpus

Number of books per gender of authors:

Number of books per gender of authors

Number of books per reader age group:

Number of books per reader age group

Number of segments per author and reader age group:

Number of segments per author and reader age group

Distribution of total words:

Distribution of total words

Type-token ratio:

Type-token ratio

Authors

Publications per author over time

Almond Fine Gaiman Pullman Rowling

Results

Pre-NER ~ Post-NER ~
Acc. F1 Acc. F1
David Almond .914 .795 .933 .857
Anne Fine .843 .800 .979 .976
Neil Gaiman .939 .928 .931 .916
Phillip Pullman .918 .771 .962 .906
J.K. Rowling .991 .991 .999 .999
All authors .764 .680 .788 .734

Citation

@article{
    title = {Identifying Crosswriters' Altering Style in Books for Children and Adults Using Supervised Machine Learning},
    author = {{Dimitris Boumparis}},
    organization = {{University of Antwerp}},
    year = {2022},
    url = {https://github.com/dimboump/crosswriters}
}