Created by Eva Bacas.
This is Eva's term project for LING 1340: Data Science for Linguists. This project is an analysis of the sentiments and word frequencies in blogs, including variance across different author demographic groups.
This project uses the Blog Authorship Corpus.
J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. URL: http://www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf
README.md
- 📍 You are here! This document contains a brief description of the project, link to the dataset, and a repository directory.
project_plan.md
- Description of the dataset and my initial project plan
progress_report.md
- Project updates for each progress report
Project_presentation.pdf
- Click here to view on Google Slides- PDF version of my final project presentation
final_report.md
- In-depth explanation of the project, with background information and interpretation of the analysis
progress_report1.ipynb
- Click here to view on nbviewer- In my first progress report, I read the CSV file into a data frame and explored the data. I calculated basic stats about blogger demographics, but I forgot to consider that there are multiple blogs per blogger.
progress_report_part2.ipynb
- Click here to view on nbviewer- In my second progress report, I continued exploring the data frame and corrected my issues from my first progress report. I began my analysis by looking at word frequencies and topic modeling.
progress_report_part3.ipynb
- Click here to view on nbviewer- In my third progress report, I explored sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner). I categorized blogs as positive, negative, or neutral, and then investigated variation across blogger groups and most frequent words per sentiment category.
progress_report_part3b.ipynb
- Click here to view on nbviewer- This is a continuation of the third progress report. I switched to a new Jupyter notebook so I could use R. I attempted to create a mixed effects regression model using the demographic info as predictors and polarity score as an outcome. It didn't work and nothing was significant.
/data_samples
- Contains a 100 blog sample of the dataset
/images
- PNG versions of all graphs and additional images in my project presentation and final report
LICENSE.md
- GNU General Public License v3.0
.gitignore
- Prevents all my data from being uploaded
This code is licensed under the GNU General Public License v3.0.
Please leave comments, suggestions, and questions in my guestbook.