This is my Master of Science in Business Analytics (MSBA) capstone project in spring 2023. The primary dataset includes large-scale text data transcribed from 194 hours of Democratic National Convention (DNC) and Republican National Convention (RNC) speeches from 2004 to 2020. The text data were transformed to a SQLite database with 3470 rows and 9 columns including year
, party
, day
, speaker
, speaker count
, time
, text
, text length
, and the source of text
.
An extended dataset we used for this project was 1038 presidential speeches from 1789 to 2021, from George Washington to Joe Biden, for permutation testing. These speeches were delivered by 45 U.S. Presidents, 445 of which were from 19 Republican Presidents and 513 of which were from 16 Democratic Presidents.
We used two research approaches, topic modeling and permuation tests, in this project. The Python code for topic modeling was written in Jupyter Notebook. The R code for permutation tests was written in R Markdown and knitted to html.
- Topic modeling: to track the evolution of topics from 2004 to 2020.
- Permutation tests: to compare speech features at the subtle linguistic granularity level.
Our topic modeling identified topics that gained or lost favor over time and topics that consistently reflected core values of the two parties. Our permutation test analysis showed statistically significant differences in past tense usage between the two parties in two corpora and in first-person singular and plural pronoun usage in convention speeches.
- Permutation tests with R