- Definition
- Authorial fingerprint
- Requirements
- Hands-on example with
stylo
in R
Stylometry is the quantitative, computational study of (literary) style.
-
Discipline: Digital Literary Stylistics, Computational Literary Studies, Computational Stylistics
-
Text Analysis of authors and writing: attribution of authorship; intertextuality; text-reuse; sentiment analysis; topic modelling; author profiling (gender, age range, native language); character networks; NLP with literary text.
-
Distant analysis: textual genres; literary canon and movements, etc. (John Burrows, Franco Moretti, Matthew Jockers, Ted Underwood)
-
Leon Batista Alberti (1404–1472)
Determining the authorship of an anonymous text, or to solve a controversial authorship.
-
closed-set context: the real author is one of the candidates within the sample of documents.
-
open-set context: possible author or unknown one.
-
verification: samples of unique candidates vs. test document
-
Detection of fabricated stories.
-
Collaborative writing (e.g. theater plays).
-
Lorenzo Valla (15th C.), Donation of Constantine
-
Wincent Lutosławski (1897), metoda stylometryczna
-
Thomas C. Mendenhall (1901), Shakespeare authorship argument
-
Mosteller and Wallace (1964), The Federalist Papers
- Stylistic fingerprint: "Le style c’est l’homme"
- Function words (prepositions, pronouns, determiners, ...)
stylo
package in R (Eder, Rybicki, Kestemont 2016)- Python libraries (see, e.g.: Karsdorp, Kestemont and Riddell, "Stylometry and the Voice of Hildegard", Humanities Data Analysis: Case Studies with Python, Princeton University Press, 2021.
- JGAAP (Java Graphical Authorship Attribution Program), (Juola 2005)
- txt, xml
- extracted features: e.g. words
- number of features: 500 MFW
- author-based corpus
- text size: 2000-5000 words
- text noise: not (so) relevant
Texts as similar as possible, in period, in genre, etc.
“the minimal sample size can be lowered substantially, from ca. 5,000 running words as suggested previously (Eder, 2015),to less than 2,000 words. However, this is true only for those texts that exhibit a clear authorial signal; otherwise the risk of severe misclassification appears.” (Eder, 2017: 3)
The Ambassadors by Henry James compared against a corpus of 100 English novels (Eder 2017)
Bleak House by Charles Dickens (Eder 2017)
-
The absolute cleanliness of the text produced by OCR or HTR is not an essential condition for authorship attribution (Franzini et al. 2018)
-
Paratexts: dedication, acknowledgements, opening information, didascalias, character names in Theater plays
<stage>(Sale MOSCATEL.)</stage>
<sp who="#moscatel">
<speaker rend="caps">Moscatel</speaker>
<lg>
<l>¿Que no? Luego</l>
<l>si yo a tener amor llego</l>
<l>noble será mi pasión.</l>
</lg>
</sp>
<sp who="#don-alonso">
<speaker rend="caps">Don Alonso</speaker>
<lg>
<l>¿Tú amor?</l>
</lg>
</sp>
<sp who="#moscatel">
<speaker rend="caps">Moscatel</speaker>
<lg>
<l>Yo amor.</l>
</lg>
</sp>
calderon_NoHayBurlasConELAmor.txt
¿Que no? Luego
si yo a tener amor llego
noble será mi pasión.
¿Tú amor?
Yo amor.
...
The difference in the use of words between texts: the smaller the distance, the greater the similarity.
Text A
The boy could not resist the temptation of sweets and took marshmallows and liquorice and lollipops and candied fruit and chocolates.
Text B
The sun was setting, and the birds were singing their evening song. The gentle breeze rustled the leaves, and a distant sound of children playing could be heard.
Modified image from Büttner et al. 2017, CC-BY
It computes distances (differences) between texts and plots graphs of those distances.
-
Computational Stylistics Group, based in Kraków.
-
Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: a package for computational text analysis. R Journal 8(1): 107-121. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html
- Install R and Rstudio
- Download this repository (code > download zip) in your computer (notice the path to the folder)
- Run Rstudio
- Install
stylo
- Using stylo with a GUI
- General overview
- No GUI
- unattributed text
- Consensus Tree
- Distance Table
- Closed set or open-set
- Networks on big corpora
A Short Collection of British Fiction
Büttner, Andreas, Friedrich Michael Dimpel, Stefan Evert, Fotis Jannidis, Steffen Pielström, Thomas Proisl, Isabella Reger, Christof Schöch, and Thorsten Vitt. 2017. “»Delta« in der stilometrischen Autorschaftsattribution.” ZfdG - Zeitschrift für digitale Geisteswissenschaften 2. https://doi.org/10.17175/2017_006.
Eder, Maciej. 2015. “Does Size Matter? Authorship Attribution, Small Samples, Big Problem.”_ Digital Scholarship in the Humanities_ 30 (2): 167–82. https://doi.org/10.1093/llc/fqt066.
Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: a package for computational text analysis. R Journal 8(1): 107-121. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html
Eder, Maciej. 2017. “Short Samples in Authorship Attribution: A New Approach.” In Digital Humanities 2017: Conference Abstracts, 221–24. Montreal: McGill University. https://dh2017.adho.org/abstracts/341/341.pdf.
Eder, Maciej. 2017. “Visualization in Stylometry: Cluster Analysis Using Networks.” Digital Scholarship in the Humanities 32 (1): 50–64. https://doi.org/10.1093/llc/fqv061.
Franzini, Greta, Mike Kestemont, Gabriela Rotari, Melina Jander, Jeremi K. Ochab, Emily Franzini, Joanna Byszuk, and Jan Rybicki. 2018. “Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm.” Frontiers in Digital Humanities 5 (April): 4. https://doi.org/10.3389/fdigh.2018.00004.
Savoy, Jacques. 2020. Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-53360-1.