Yelp-Data-Challenge

This project showcases building a recommendendation engine using Yelp dataset. The project covers following phases:

Extracting data from JSON files and storing it as Parquet files using PySpark.
Creating the necessary training dataset from the stored Parquet files using PySpark.
Training-Crossvalidation-Testing recommendation model using alternating least squares (collaborative filtering) algorithm available in PySpark.
Deploying the trained model as Flask API.

Obtaining the Docker Image

The docker images can be found in the Nandeesh's repository on Docker Hub.

To get the docker image, the following pull command can be used.

docker pull nandee/jupyter-pyspark

Note: Hardware requirement to run the Spark jobs in the container

Since the Spark jobs are run locally, some jobs in the container need around 16Gb of RAM. So minimum requirement is to allocate more than 16Gb of RAM. If possible allocate more than 4 CPUs.

To make these changes follow the instruction in this Stack Overflow answer.

OR

If you have a YARN cluster or any other appropriate cluster available to run the jobs, then provide it to the spark-submit using --master <master-url>. And specify --deploy-mode <deploy-mode> depending on whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client).

Running the Image

To run in detached mode

docker run --rm -d -p 8888:8888 -p 9001:9001 -p 4040-4042:4040-4042 \
                   -e GRANT_SUDO="yes" \
                   --user root \
                   --name jupyter \
                   nandee/yelp-data-challenge

Enter the following command to get the URL to access the notebook.

docker logs jupyter

Or simply run the following command and copy-paste the URL shown in terminal into browser to access notebook

docker run -it --rm -p 8888:8888 -p 9001:9001 -p 4040-4042:4040-4042 \
                    -e GRANT_SUDO="yes" \
                    --user root \
                    --name jupyter \
                    nandee/yelp-data-challenge

Download data, extract and convert JSON to Parquet files for storage

After opening the Jupyter Notebook link in a browser, go to yelp-data-challenge directory and open Extracting Yelp Dataset.ipynb notebook. Run the commands and Spark jobs in the notebook to fulfill the task.

Generate training dataset and train a model

Open the Recommender Model.ipynb notebook and follow through the commands and Spark jobs to generate dataset, train a model and launch a Flask API to serve the model.

Using the API to give recommendations to users

Open Recommender System.ipynb notebook which shows few examples on how to use the model.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
plots		plots
recommender		recommender
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
Extracting Yelp Dataset.ipynb		Extracting Yelp Dataset.ipynb
LICENSE		LICENSE
README.md		README.md
Recommender Model.ipynb		Recommender Model.ipynb
Recommender System.ipynb		Recommender System.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yelp-Data-Challenge

Obtaining the Docker Image

Note: Hardware requirement to run the Spark jobs in the container

Running the Image

To run in detached mode

Or simply run the following command and copy-paste the URL shown in terminal into browser to access notebook

Download data, extract and convert JSON to Parquet files for storage

Generate training dataset and train a model

Using the API to give recommendations to users

About

Releases

Packages

Languages

License

NandeeshHD/yelp-data-challenge

Folders and files

Latest commit

History

Repository files navigation

Yelp-Data-Challenge

Obtaining the Docker Image

Note: Hardware requirement to run the Spark jobs in the container

Running the Image

To run in detached mode

Or simply run the following command and copy-paste the URL shown in terminal into browser to access notebook

Download data, extract and convert JSON to Parquet files for storage

Generate training dataset and train a model

Using the API to give recommendations to users

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages