- Dataset is taken from Kaggle
- This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.
To determine whether a review is positive or negative and build a machine learning model around it.
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews
- Id
- ProductId - Unique identifier for the product
- UserId - Unqiue identifier for the user
- ProfileName - Profile name of user
- HelpfulnessNumerator - Number of users who found the review helpful
- HelpfulnessDenominator - Number of users who indicated whether they found the review helpful or not
- Score - Rating between 1 and 5
- Time - Timestamp for the review
- Summary - Brief summary of the review
- Text - Text of the review
- Convert everything to lowercase
- Remove HTML tags
- Remove URL from sentence
- Contraction mapping
- Eliminate punctuations and special characters
- Remove stopwords
- Remove short words
- Confusion Matrix
- Logistic regression with TF-IDF and BOW model are giving more accurate result.
- Also Random Forest with BOW model is giving better result as compare to Naive Bayes.