This project goes through the entire data science pipeline in an attempt to better understand book reviews data on the Goodreads website. The main objective was to examine the sentiments of user reviews and book ratings across numerous genres. The approach was to train a model on a large amount of labeled data to generalize well enough to classify the relatively small, unlabeled Goodreads reviews data set.
This work examines these relationships as a NLP problem - namely, a document level sentiment classification problem. Sentiment classifications are made and then data visualization techniques are used to gain insight into the review-rating-genre relationship.
Three machine learning techniques were used in this project to obtain classifications. One classification is done using a pretrained RNN with long short term memory units (LSTMs) and a pretrained GloVe model; both were pretrained by Adit Deshpande and may be found here. The embeddings were trained using the word vector generation model GloVe. The word embedding matrix contains 400,000 word vectors with words having dimensionality of 50. The RNN was trained on the IMDb movie review dataset containing 12,500 positive and 12,500 negative reviews.
The second classification method was done by training a bidirection LSTM network using pretrain fastText embedding from here.
The third classification method used a Naive Bayes model trained on the TF-IDF of words in each sentence constructed into the feature matrix. This was done with Apache Spark ML.
- Further analysis and visualization are needed to reach conclusions.
- Port XIA-NB classifier to run on GPU.
The bar chart was adopted from Brice Pierre de la Briere's article. The red bars represent average book ratings where there were more negative reviews predicted by the LSTM network than positive ones. The larger number of blue bars indicates that the Goodreads rating system is representative of user sentiments.
These graphs were generated with code adapted from Matrin Chorley's article. The nodes are colored by genre, and their radii vary by the average rating of the title. Positions in the y-direction are given by the rating multiplied by the sentiment (+1 or -1).
This force-directed graph was generated with code adapted from Martin Chorley's article and Mike Bostock's here.
- Web scraping: Scrapy 1.4.0, Selenium (3.8.0), PyMySQL 0.8.0.
- ML and computation: Pandas (0.22.0), NumPy (1.14.2), SQLAlchemy (1.2.7).
- Dataviz: D3.js version 4, seaborn (0.9.0).
- Install dependencies:
$ python -m virtualenv goodreads
$ source goodreads/bin/activate
$ pip install -r requirements.txt
- Create SQL table to store Goodreads review data:
CREATE TABLE `reviews` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(128) NOT NULL,
`genre` varchar(255) NOT NULL,
`link_url` varchar(255) NOT NULL,
`book_url` varchar(255) NOT NULL,
`user` varchar(32) NOT NULL,
`reviewDate` varchar(32) NOT NULL,
`review` text NOT NULL,
`rating` varchar(24) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=502 DEFAULT CHARSET=latin1;
- Run Scrapy web crawler:
$ cd utils
$ scrapy crawl goodreads
In pipelines.py, you may add certain words to the words_to_filter array in the RequiredFieldsPipeline class to filter the reviews.
- Choose classification algorithm to run: change to
goodreads/models
directory and run one of the following.
- LSTM network:
python train_eval_pipeline.py --train
- SparkSentimentAnalysis.ipynb
- Visualize data:
..1. Start php server in
goodreads/visualization
directory:php -S localhost:8000
. If you usepython -m http.server
, you will get the error "Failed to load http://localhost:8000/data.php: No 'Access-Control-Allow-Origin' header is present on the requested resource..." ..2. Open index.html in browser.
-
Adit Deshpande's article on oreilly.com.
-
The Naive Bayes Classifier by the Text Mining Group, Nanjing University of Science & Technology,.