Skip to content

visnunathan8/Analysis-of-Yelp-dataset-using-Spark-MPI-Pandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysing Yelp Dataset using Spark and Comparative Study with different distributed processes

Dataset - ( ~9 GB )

https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset

Tables used :

Yelp_academic_dataset_business.json

Yelp_academic_dataset_checkin.json

Yelp_academic_dataset_review.json

Yelp_academic_dataset_tip.json

Yelp_academic_dataset_user.json


Architecture Diagram

image


Spark Features Implementation

  1. Persistence
  2. Lazy evaluation
  3. Fault tolerance
  4. Data Partitioning
  5. Parallelism
  6. Transparency

1. Persistence

image

2. Lazy evaluation

image

Implementation of a Distributed System to execute Spark Using Multiple Computers (1 master and 2 workers)

image image image ¸ image

3. Fault tolerance

image

4. Parallelism

image

image

5. Data Partitioning

image

image

6. Transparency - Data Lineage

image

image

image

Ensuring transparency in Spark data processing with the explain() method

image

9GB file pyspark execution for the usecase in local system(8 Core MacOs)

image

Yelp Dataset Analysis & Comparative Analysis of Distributed Programming Techniques

image

https://public.tableau.com/app/profile/san.vinoth/viz/YelpDatasetComparativeAnalysis/YelpAnalysis?publish=yes

Additional Works - (Hadoop Cluster)

image

Project Components:

image

Learning

Spark

Databricks

Azure Blob

Tableau

Setting up standalone clusters

Hadoop environment setup

Team Work

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •