-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the ShakespearesWorld-Data wiki!
Victoria Van Hyning 2022-2023 Folger Fellowship Proposal Project title: “Preparing and publishing Shakespeare’s World data for further use and reuse”
Project Description: This project will focus on cleaning, describing, and publishing three datasets derived from Shakespeare's World (SW), a crowdsourced text transcription project created in collaboration between Folger, Zooniverse.org (Oxford University), and the Oxford English Dictionary (OED). SW ran from 2015 to 2019, and attracted 3,926 registered volunteers, and an unknown number of anonymous participants , who transcribed 11,490 digital images of manuscript pages (single and double page spreads) of recipes, Newdigate Newsletters, and sixteenth-seventeenth-century letters from Folger’s holdings.1 The recipes and letters were chosen due to the larger number of women writers in the corpus, and the hope (borne out by the project) that we would find examples of words and antedatings for inclusion in the OED, thereby increasing the representation of women and manuscript materials in the dictionary.
A four-week fellowship in the summer of 2022 would afford me the crucial space and time needed to prepare these data for a wide audience. Project deliverables will include three open source, downloadable, and well-described bulk datasets (detailed below), and a “short data paper” for the peer-reviewed, open access Journal of Open Humanities Data, which will describe the datasets and their potential for reuse.2 In future, this will support a longer “research paper” for JOHD, to supplement the short data paper. As as early modernist, the former Humanities PI of Zooniverse, and the co-Investigator of SW, with Heather Wolfe, I am uniquely positioned to undertake this work, and believe it will promote the use of Folger collections in a variety of disciplines such as linguistics, etymology, literary, historical, and theological studies, scholarly editing, social network analysis (for the historical material as well as for the modern Zooniverse discussion board), Machine Learning, Handwritten Text Recognition (HTR), crowdsourcing.
The datasets:
-
Dataset 1 will consist of a bulk dataset of transcriptions created by SW volunteers and refined by SW and Folger staff. This will include the raw data (meaning the individual transcriptions made by each volunteer), and any refined and cleaned transcriptions that are available by summer 2023.
-
Dataset 2 consists of Zooniverse platform metrics of SW volunteers’ activity, such as number of volunteers, number of pages transcribed, and time spent on the website. These data have never been released.
-
Dataset 3 consists of the messages and related metadata from the project’s discussion board, “Talk.” This includes public conversations between volunteers, guest researchers, and project staff. Talk data is scrapeable through the Zooniverse API, but is not packaged and described anywhere.
SW was designed to be fun and inclusive: a place where students, volunteers, researchers and others could learn about the early modern world through manuscripts they might never have sought out or been able to access, due to distance from DC or insufficient training in handling rare materials. One way we did this was by letting people choose to transcribe as little as a word or line on a page, rather than the whole page, and to make it clear that they should only contribute what they felt confident reading. We hoped this would let people build their skill and confidence over time. Volunteers clicked at the start and end of each string of text they wanted to transcribe, transcribed it into a box, and used a clickable keyboard to add deletions, superscripts, brevigraphs, etc. to the transcription.6 Through this method, each word on each page was independently transcribed by three or more people, and their transcriptions were compared and combined together in real time to create a majority rules reading. This was in keeping with broader Zooniverse methods of promoting nonspecialist engagement, and met Folger’s goals of democratizing access to collections, while attempting to collect quality data.
We initially used a clustering algorithm to group the dots, and a genetic sequencing algorithm to compare the text strings.7 These dots provide positional data for each line on each page–invaluable information for several machine learning protocols, including Handwritten Text Recognition. This is an area of research that Folger has already supported using SW data. Both Google and Adam Matthew Digital (Quartex) have used SW data to train HTR systems.8 However, there’s plenty more that can be done in this regard, particularly making open source, rather than paywalled, HTR systems.
The early aggregation results of SW were promising, showing approximately 98 percent agreement between transcribers. But as the project went on and the heterogeneity of both the participants and the documents increased, clustering accuracy dipped, which negatively impacted text string comparison. From 2015 to 2020 the Zooniverse and Folger attempted to improve aggregation, with mixed success. In 2021, Wolfe and Van Hyning, several Folger colleagues, and five capstone students from UMD’s iSchool, collaborated on a SW data cleaning project.9 The students used Python to strip out duplicate lines of text, which was one of the most common aggregation errors. These transcriptions are being uploaded to the FromThePage crowdsourcing platform and edited by Folger docents, volunteers, and staff before they are published in Luna. The pages uploaded to EMMO go through a different pathway. They are extensively refined into diplomatic, semi-diplomatic, and normalized spellings, with additional tags and TEI P-4 markup. The pages in Luna are largely semi-diplomatic. All of these forms of transcription data have significant, and distinct, research value. The clean transcriptions in EMMO and Luna provide multiple search and discovery pathways. In bulk they can support scholarly editions, sentiment analysis, Natural Language Processing, HTR, and more. The uncleaned raw data from each volunteer has another purpose. It reveals common mis-readings and slips such as eyeskip, as well as information about individual transcribers (whose identities will be anonymized in the published data). The raw data are akin to dirty OCR, which has been used by the Viral Texts project team to help backwards-engineer common OCR errors, which in turn can be used to automatically clean up dirty OCR.10 The raw SW data could be used to train computers to identify common transcription errors–findings that would have both scholarly and commercial applications.
Datasets 2 & 3 will support research into volunteer engagement and online communities. For example, our preliminary investigations reveal that patterns of participation on SW are pretty different from engagement on Transcribe Bentham, another well-known crowdsourcing project. Rather than being driven by a small handful of dedicated volunteers, SW drew more sustained participation from a wider cohort. This was likely due to encouraging people to only transcribe lines they felt confident about, rather than whole pages.