Wikipedia has been widely used as the training corpus for mnay different natural language processing (NLP) models. Meanwhile, according to [a survey] conducted by Wikimedia foundation, fewer than 18% of Wikipedia biographies are about women. To understand how women are represented in English Wikipedia biographies, this project aims to answer the following questions through text analysis in R:
- To what extent do biographies about female subjects differ from those about male subjects?
- What terms are more strongly associated with each gender?
- By training classification models to predict subject gender based on biography text, we find that our models outperform no-skill, random baseline. In other words, Wikipedia editors tend to use different words and/or expressions when describing female and male subjects.
- By looking at the most predictive terms for each gender, we find that Wikipedia editors are more likely to use terms relating to gender (e.g. "female", "woman") and family (e.g. "husband", "mother") in biographies about female, suggesting that male is often considered the norm.
The dataset include 19.4K English Wikipedia biographies from four categories (artist, athlete, scientist, and politician) with the most biographical articles on Wikiepdia. The data was created by
- Randomly sampling 5K entry names from each category using Wikipedia biography metadata
- Retrieving article text for the sampled entries using
wikipedia
package