Identifying Crosswriters’ Altering Style in Books for Children and Adults Using Supervised Machine Learning

This repo contains the code (not the data!) written as part of the Computational Literary Studies (CLS) final project at the University of Antwerp.

The objective was to identify the differences (if any) in the writing style of authors who write books for children and adults ("crosswriters") by only focusing on content words.

The paper is available in PDF format.

Development

Dependencies

Standard:

__future__
os
re
glob
typing

Third-party:

Numpy
Pandas
Matplotlib
Seaborn
Scikit-Learn
Transformers
pprint

Environment

Windows 11 + WSL
Python 3.9.12 (virtualenv)

Abstract

Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognisable, and unique ways (Laramée, 2018). Identifying the similarities and differences in style, content, and genre between literature intended for children and adults has always been under the radar of researchers in the field of Computational Literary Studies. However, only recently has examining the implications of cross-writing (i.e., writing works for various readership age groups) gotten attention. In this study, supervised machine learning methods were applied to get a better understanding on whether and how such authors (“crosswriters”) alter their style when targeting a different age group, based entirely on content words. The study was conducted on 5 English authors, and the SVM models reach an F1 macro score of .73 when predicting the age group using all texts and .93 on average for each of the authors individually. To achieve these results, it was essential to overcome the issue of overfitting on the characters of the stories, which was dealt with by (a) implementing a Named Entity Recognition (NER) step in the preprocessing pipeline; and (b) leaving at least one book by each author out of the train set entirely in each of the folds during Cross-Validation.

Exploratory Data Analysis

The authors whose texts were examined are:

David Almond
Anna Fine
Neil Gaiman
Philip Pullman
J.K. Rowling

The images are light and dark-mode aware! Check it out through your appearance settings.

Corpus

Number of books per gender of authors:

Number of books per reader age group:

Number of segments per author and reader age group:

Distribution of total words:

Type-token ratio:

Authors

Publications per author over time

Results

	Pre-NER	~	Post-NER	~
	Acc.	F1	Acc.	F1
David Almond	.914	.795	.933	.857
Anne Fine	.843	.800	.979	.976
Neil Gaiman	.939	.928	.931	.916
Phillip Pullman	.918	.771	.962	.906
J.K. Rowling	.991	.991	.999	.999
All authors	.764	.680	.788	.734

Citation

@article{
    title = {Identifying Crosswriters' Altering Style in Books for Children and Adults Using Supervised Machine Learning},
    author = {{Dimitris Boumparis}},
    organization = {{University of Antwerp}},
    year = {2022},
    url = {https://github.com/dimboump/crosswriters}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data/shuffled_corpus		data/shuffled_corpus
img		img
.gitignore		.gitignore
0-get_english_corpus.ipynb		0-get_english_corpus.ipynb
1-models.ipynb		1-models.ipynb
2-all_authors-pre_ner.ipynb		2-all_authors-pre_ner.ipynb
Identifying_Crosswriters_Altering_Style_in_Books_for_Children_and_Adults_Using_Supervised_Machine_Learning.pdf		Identifying_Crosswriters_Altering_Style_in_Books_for_Children_and_Adults_Using_Supervised_Machine_Learning.pdf
README.md		README.md
entities.txt		entities.txt
requirements.txt		requirements.txt
shuffle_corpus.py		shuffle_corpus.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Crosswriters’ Altering Style in Books for Children and Adults Using Supervised Machine Learning

Development

Abstract