YelpDataExploration

Tow datasets (review_small.json and business_small.json) are samples extracted from the Yelp dataset (https://www.yelp.com/dataset).

Sample review data:

{'review_id': '-I5umRTkhw15RqpKMl_o1Q', 'user_id': '-mA3-1mN4JIEkqOtdbNXCQ', 'business_id': 'mRUVMJkUGxrByzMQ2MuOpA', 'stars': 1.0, 'text': "Walked in around 4 on a Friday afternoon, we sat at a table just off the bar and walked out after 5 min or so. Don't even think they realized we walked in. However everyone at the bar noticed we walked in!!! Service was non existent at best. Not a good way for a new business to start out. Oh well, the location they are at has been about 5 different things over the past several years, so they will just be added to the list. SMDH!!!", 'date': '2017-12-15 23:27:08'}

Sample business data:

{'business_id': '1SWheh84yJXfytovILXOAQ', 'name': 'Arizona Biltmore Golf Club', 'address': '2818 E Camino Acequia Drive', 'city': 'Phoenix', 'state': 'AZ', 'postal_code': '85016', 'latitude': 33.5221425, 'longitude': -112.0184807, 'stars': 3.0, 'review_count': 5, 'is_open': 0, 'attributes': {'GoodForKids': 'False'}, 'categories': 'Golf, Active Life', 'hours': None}

TASK 1

Explore the review datasets:

A. The total number of reviews

B. The number of reviews in a given year, y

C. The number of distinct users who have written the reviews

D. Top m users who have the largest number of reviews and its count

E. Top n frequent words (lowercase) in the review text. The following punctuations i.e., “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)”, and the given stopwords are excluded

Execution commands: spark-submit task1.py <input_file> <output_file>

input_file – the input file (the review dataset)

output_file – the output file contains answers

stopwords – the file contains the stopwords that will be removed for E

TASK 2

Compute the average stars for each business category and output top n categories with the highest average stars (with an without spark).

Execution commands: spark-submit task2.py <review_file> <business_file > <output_file> <if_spark>

review _file – the input file (the review dataset)

business_file – the input file (the business dataset)

output_file – the output file contains answers

if_spark – either “spark” or “no_spark”

n – top n categories with highest average stars

TASK 3

Compute the businesses that have more than n reviews in the review file. At the same time, show the number of partitions for the RDD and the number of items per partition with either default or customized partition function. The customized one will have higher computational efficiency.

Execution commands: spark-submit task3.py <input_file> <output_file> <partition_type> <n_partitions>

input_file – the input file (the review dataset)

output_file – the output file contains answers

partition_type – the partition function, either “default” or “customized”

n_partitions – the number of partitions (only effective for the customized partition function)

n – the threshold of the number of reviews

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YelpDataExploration

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
business_small.json		business_small.json
review_small.json		review_small.json
task1.py		task1.py
task2.py		task2.py
task3.py		task3.py

mandychumt/YelpReviews

Folders and files

Latest commit

History

Repository files navigation

YelpDataExploration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages