GitHub - JohnSell620/sentiment-analysis-goodreads-reviews: Document-level sentiment analysis of book reviews scraped from the Goodreads website. Technologies used include TensorFlow, Spark, HDFS, Sqoop, Scrapy, and D3.js.

Overview

This project goes through the entire data science pipeline in an attempt to better understand book reviews data on the Goodreads website. The main objective was to examine the sentiments of user reviews and book ratings across numerous genres. The approach was to train a model on a large amount of labeled data to generalize well enough to classify the relatively small, unlabeled Goodreads reviews data set.

This work examines these relationships as a NLP problem - namely, a document level sentiment classification problem. Sentiment classifications are made and then data visualization techniques are used to gain insight into the review-rating-genre relationship.

Three machine learning techniques were used in this project to obtain classifications. One classification is done using a pretrained RNN with long short term memory units (LSTMs) and a pretrained GloVe model; both were pretrained by Adit Deshpande and may be found here. The embeddings were trained using the word vector generation model GloVe. The word embedding matrix contains 400,000 word vectors with words having dimensionality of 50. The RNN was trained on the IMDb movie review dataset containing 12,500 positive and 12,500 negative reviews.

The second classification method was done by training a bidirection LSTM network using pretrain fastText embedding from here.

The third classification method used a Naive Bayes model trained on the TF-IDF of words in each sentence constructed into the feature matrix. This was done with Apache Spark ML.

TODO

Further analysis and visualization are needed to reach conclusions.
Port XIA-NB classifier to run on GPU.

Latest Results

The bar chart was adopted from Brice Pierre de la Briere's article. The red bars represent average book ratings where there were more negative reviews predicted by the LSTM network than positive ones. The larger number of blue bars indicates that the Goodreads rating system is representative of user sentiments.

These graphs were generated with code adapted from Matrin Chorley's article. The nodes are colored by genre, and their radii vary by the average rating of the title. Positions in the y-direction are given by the rating multiplied by the sentiment (+1 or -1).

This force-directed graph was generated with code adapted from Martin Chorley's article and Mike Bostock's here.

Dependencies

Web scraping: Scrapy 1.4.0, Selenium (3.8.0), PyMySQL 0.8.0.
ML and computation: Pandas (0.22.0), NumPy (1.14.2), SQLAlchemy (1.2.7).
Dataviz: D3.js version 4, seaborn (0.9.0).

Usage

Install dependencies:

$ python -m virtualenv goodreads
$ source goodreads/bin/activate
$ pip install -r requirements.txt

Create SQL table to store Goodreads review data:

CREATE TABLE `reviews` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `title` varchar(128) NOT NULL,
 `genre` varchar(255) NOT NULL,
 `link_url` varchar(255) NOT NULL,
 `book_url` varchar(255) NOT NULL,
 `user` varchar(32) NOT NULL,
 `reviewDate` varchar(32) NOT NULL,
 `review` text NOT NULL,
 `rating` varchar(24) NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=502 DEFAULT CHARSET=latin1;

Run Scrapy web crawler:

$ cd utils
$ scrapy crawl goodreads

In pipelines.py, you may add certain words to the words_to_filter array in the RequiredFieldsPipeline class to filter the reviews.

Choose classification algorithm to run: change to goodreads/models directory and run one of the following.

LSTM network: python train_eval_pipeline.py --train
SparkSentimentAnalysis.ipynb

Visualize data: ..1. Start php server in goodreads/visualization directory: php -S localhost:8000. If you use python -m http.server, you will get the error "Failed to load http://localhost:8000/data.php: No 'Access-Control-Allow-Origin' header is present on the requested resource..." ..2. Open index.html in browser.

Acknowledgements

Adit Deshpande's article on oreilly.com.
The Naive Bayes Classifier by the Text Mining Group, Nanjing University of Science & Technology,.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

TODO

Latest Results

Dependencies

Usage

Acknowledgements

About

Releases

Packages

Languages

License

JohnSell620/sentiment-analysis-goodreads-reviews

Folders and files

Latest commit

History

Repository files navigation

Overview

TODO

Latest Results

Dependencies

Usage

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages