This project showcases building a recommendendation engine using Yelp dataset. The project covers following phases:
- Extracting data from JSON files and storing it as Parquet files using PySpark.
- Creating the necessary training dataset from the stored Parquet files using PySpark.
- Training-Crossvalidation-Testing recommendation model using alternating least squares (collaborative filtering) algorithm available in PySpark.
- Deploying the trained model as Flask API.
The docker images can be found in the Nandeesh's repository on Docker Hub.
To get the docker image, the following pull
command can be used.
docker pull nandee/jupyter-pyspark
Since the Spark jobs are run locally, some jobs in the container need around 16Gb of RAM. So minimum requirement is to allocate more than 16Gb of RAM. If possible allocate more than 4 CPUs.
To make these changes follow the instruction in this Stack Overflow answer.
OR
If you have a YARN
cluster or any other appropriate cluster available to run the jobs, then provide it to the spark-submit
using --master <master-url>
.
And specify --deploy-mode <deploy-mode>
depending on whether to deploy your driver on the worker nodes (cluster
) or locally as an external client (client
).
docker run --rm -d -p 8888:8888 -p 9001:9001 -p 4040-4042:4040-4042 \
-e GRANT_SUDO="yes" \
--user root \
--name jupyter \
nandee/yelp-data-challenge
Enter the following command to get the URL to access the notebook.
docker logs jupyter
Or simply run the following command and copy-paste the URL shown in terminal into browser to access notebook
docker run -it --rm -p 8888:8888 -p 9001:9001 -p 4040-4042:4040-4042 \
-e GRANT_SUDO="yes" \
--user root \
--name jupyter \
nandee/yelp-data-challenge
After opening the Jupyter Notebook link in a browser, go to yelp-data-challenge
directory and open Extracting Yelp Dataset.ipynb
notebook.
Run the commands and Spark jobs in the notebook to fulfill the task.
Open the Recommender Model.ipynb
notebook and follow through the commands and Spark jobs to generate dataset, train a model and launch a Flask API to serve the model.
Open Recommender System.ipynb
notebook which shows few examples on how to use the model.