- Dataset Card Creation Guide
- Homepage: Amazon Fine Food Reviews
- Repository: MLOPS-SentiBites
- Paper: J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.
This dataset consists of reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review in English. It also includes reviews from all other Amazon categories.
The objective of the dataset is given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).
English
{
'id': 1,
'ProductId': "B001E4KFG0",
'UserId': "A3SGXH7AUHU8GW",
'ProfileName': "delmartian",
'HelpfulnessNumerator': 1,
'HelpfulnessDenominator': 1,
'Score': 5,
'Time': 1303862400,
'Summary': "Good Quality Dog Food",
'Text': "I have bought several of the Vitality canned dog food products and have found them all to be of good..."
}
id
: unique identifier for the review (integer)ProductId
: unique identifier for the product (integer)UserId
: unique identifier for the user (integer)ProfileName
: profile name of the user (string)HelpfulnessNumerator
: number of users who found the review helpful (integer)HelpfullnessDenominator
: number of users who indicated whether they found the review helpful or not (integer)Score
: rating between 1 and 5 (integer, output)Time
: timestamp for the review (timestamp)Summary
: brief summary of the review (string)Text
: text of the review (string)
Tag Set
annotations_creators: []
language:
- '''en'''
language_creators:
- found
license:
- cc0-1.0
multilinguality:
- monolingual
pretty_name: Food Reviews
size_categories:
- 100K<n<1M
source_datasets: []
tags: []
task_categories:
- text-classification
task_ids:
- sentiment-analysis
We will use the well-known train test split method to validate the model. 70% of the dataset will be used to train the model, while the other 30% will be used to test it.
The dataset was created mainly for learning and academic reasons.
The text reviews are from Amazon.
The data was collected from a pre-existing dataset on Kaggle.
Given the size of the dataset, we have only used a subset of the data for this project. Duplicate reviews were removed and then we took a random sample of 50,000 reviews.
Not specified.
The dataset contains information about the users, such as their profile name, and information about the products, such as the product id. However, the dataset does not contain any personal information about the users, such as their real name or email address.
This dataset is intended to be used in a learning context, so it's unlikely to have a significant social impact.
There is a risk of bias in the dataset, as the reviews are from Amazon, so they are likely to be biased towards products that are sold on Amazon. Also, the reviews are in English, so they are likely to be biased towards products that are sold in English-speaking countries.
There are no other known limitations.
Unknown.
@article{
author = {Julian McAuley, Jure Leskovec},
title = {From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews},
journal = {WWW},
year = {2013}
}