Skip to content
This repository has been archived by the owner on Apr 11, 2019. It is now read-only.

Latest commit

 

History

History
430 lines (284 loc) · 24.5 KB

README.md

File metadata and controls

430 lines (284 loc) · 24.5 KB

WARNING: This repository is no longer maintained ⚠️

This repository will not be updated. The repository will be kept available in read-only mode. Refer to https://github.com/IBM/sms-spam-filter-using-hortonworks for a similar example.

Building a Recommender using Data Science Experience Local and Hortonworks Data Platform

Recommendation engines are one of the most well known, widely used and highest value use cases for applying machine learning. Despite this, while there are many resources available for the basics of training a recommendation model, there are relatively few that explain how to actually deploy these models to create a large-scale recommender system.

This code pattern demonstrates the key elements of creating such a system by using Apache Spark and HDP Search (Solr). Note that this pattern is a port of Nick Pentreath's Recommender built with Elasticsearch and Apache Spark, but with a focus on using Hortonworks Data Platform (HDP) and IBM's Data Science Experience Local (DSX Local).

What is HDP and HDP Search? Hortonworks Data Platform (HDP) is a massively scalable platform for storing, processing and analyzing large volumes of data. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari. HDP Search provides applications and tools for indexing content from your HDP cluster to Solr (an open source enterprise search platform).

Hortonworks Data Platform by Hortonworks

What is DSX Local? DSX Local is an on premises solution for data scientists and data engineers. It offers a suite of data science tools that integrate with RStudio, Spark, Jupyter, and Zeppelin notebook technologies. And yes, it can be configured to use HDP, too.

This repo contains a Jupyter notebook illustrating how to use Spark for training a collaborative filtering recommendation model from ratings data stored in Solr, saving the model factors to Solr, and then using Solr to serve real-time recommendations using the model. The data you will use comes from MovieLens and is a common benchmark dataset in the recommendations community. The data consists of a set of ratings given by users of the MovieLens movie rating system, to various movies. It also contains metadata (title and genres) for each movie.

When you have completed this code pattern, you will understand how to:

  • Ingest and index user event data into Solr using the Solr Spark connector
  • Load event data into Spark DataFrames and use Spark's machine learning library (MLlib) to train a collaborative filtering recommender model
  • Export the trained model into Solr
  • Using a custom Solr plugin, compute personalized user and similar item recommendations and combine recommendations with search and content filtering

Flow

  1. Load the movie dataset into Apache Hadoop HDFS.
  2. Use Spark DataFrame operations to clean the dataset and use Spark MLlib to train a collaborative filtering recommendation model.
  3. Save the resulting model into Apache Solr.
  4. The user can run the provided notebook in IBM's Data Science Experience Local.
  5. As the notebook runs, Apache Livy will be called to interact with the Spark service in HDP.
  6. Using Solr queries and a custom vector scoring plugin, generate example recommendations.
  7. When necessary, retrieve information about movies, such as poster images, using The Movie Database APIs.

Included components

  • IBM Data Science Experience Local: An out-of-the-box on premises solution for data scientists and data engineers. It offers a suite of data science tools that integrate with RStudio, Spark, Jupyter, and Zeppelin notebook technologies.
  • Apache Spark: An open-source, fast and general-purpose cluster computing system.
  • Hortonworks Data Platform (HDP): HDP is a massively scalable platform for storing, processing and analyzing large volumes of data. HDP consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari.
  • HDP Search: HDP Search provides applications and tools for indexing content from your HDP cluster to Solr.
  • Apache Livy: Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface.
  • Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured technologies

  • Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
  • Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Prerequisites

Access to HDP Platform

The core of this code pattern is integrating Hortonworks Data Platform (HDP) and IBM DSX Local. If you do not already have an HDP cluster available for use, you will need to install one before attempting to complete the code pattern.

To install HDP v2.6.4, please follow the installation guide provided by Hortonworks. It first requires the installation of the Apache Ambari management platform which is then used to faciliate the HDP cluster installation. The Ambari Server is also required to complete a number of steps described in the following sections.

Note: Ensure that your Ambari Server is configured to use Python v2.7.

Install HDP Cluster services

Once your HDP cluster is deployed, at a minimum, install the following services as listed in this Ambari Server UI screenshot:

Note:

  1. This code pattern requires that version 2.2.0 of the Spark2 service be installed.
  2. Details on how to install and configure the Solr service are discussed in the Setup HDP Search section below.

Disable Apache Livy CSRF protection

From your Ambari Server UI, disable the Apache Livy server's CSRF proctection by changing the livy.server.csrf_protection.enable variable to false. To navigate to the panel, click on the Spark2 service, then the Configs tab, and then open the Advanced livy2-conf section.

Steps

Follow these steps to setup the proper environment to run our Recommender notebook locally.

  1. Setup HDP Search
  2. Download and install the Solr Spark connector
  3. Setup DSX Desktop
  4. Clone the repo
  5. Download and move the data to HDFS
  6. Setup python plugins
  7. Launch the notebook
  8. Run the notebook

1. Set up HDP Search

This code pattern was tested with HDP Search (Solr) v6.6.2 and assumes that it is started in cloud mode. Minimally, this can be accomplished by performing the following steps:

  1. Install the HDP Solr Management Pack.

  2. Install Solr using the Ambari Server UI. Solr should be listed as one of the available services after clicking the Action -> Add Service button.

  3. Verify that the Solr server is running. To accomplish this, either:

    1. Access the Solr admin dashboard:

      http://solr_host:8983/solr
      

      OR

    2. Run the solr status command:

      cd <solr_installation_dir>/bin
      ./solr status
      
  4. Install the Solr vector scoring plugin.

    The installation instructions and plugin jar file can be found in this repository.

    To help simplify and consolidate the instructions, the important steps are repeated below:

    Note: The following instructions assume the installation directory for Solr is /opt/lucidworks-hdpsearch/solr.

    1. Download the VectorPlugin.jar file.

    2. Create directory plugins in the dist directory of the Solr installation location.

    3. Copy VectorPlugin.jar to /opt/lucideworkds-hdpsearch/solr/dist/plugins/.

    4. Add the following lines to the solrconfig.xml file located in /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf:

    <lib dir="${solr.install.dir:../../../..}/dist/plugins/" regex=".*\.jar" />
    
    <queryParser name="vp" class="com.github.saaay71.solr.VectorQParserPlugin" />
    
    1. Add the following lines to the managed-schema file located in /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf:
    <fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true" storeOffsetsWithPositions="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
      </analyzer>
    </fieldType>
    
    <field name="vector" type="VectorField" indexed="true" termOffsets="true" stored="true" termPositions="true" termVectors="true" multiValued="true"/>
    

    Note: When making changes to the solrconfig.xml and managed_schema files, take care to not copy in any extraneous spaces or characters.

    To test out your changes, try out the samples shown in the Examples section of the plugin's README. This will require you to create test collections - see the Running Solr Docs for instructions on how to do that.

    1. Re-start the Ambari server:
    ambari-server restart
    

2. Download and install the Solr Spark connector

This code pattern reads the movie data set and computes the model vectors using Spark. The Spark dataframes representing the movie data set and model vectors are then written to Solr using the Solr Spark connector. The Solr connector for Spark can be downloaded from the Lucidworks' Spark-Solr repository.

This code pattern was tested with version 3.5.1 of the connector. The Spark configurations need to be changed to add the connector jars in the Spark driver and executor classpath. This can be done by following the steps below.

Special consideration: The Solr connector jar is available in two flavors. The first one (non-shaded) only carries classes specific to the connector. The second one (shaded) carries the connector specific classes along with all its dependencies. Ideally, we would want the shaded version that contains the dependency classes. However, it contains the full set of Spark and Scala classes which will conflict with the Spark runtime when it is specified in the Spark classpath. To work around this problem, we will need to remove all of the Spark and Scala classes from this jar.

To download and install the connector:

  1. Download the jar to a backup directory
  2. Create a temporary work directory and cd to it
  3. Extract the jar using jar -xvf <location of jar file>
  4. Run rm -rf scala
  5. Run rm -rf org/apache/spark
  6. Rebuild the jar by using jar -cvf spark-solr-3.5.1-shaded-modified.jar *
  7. Log into the Ambari Server UI
  8. Select the Spark2 service and select the Configs tab
  9. Open the Custom spark2-defaults properties section
  10. Add the following property keys:
    • spark.driver.extraClassPath
    • spark.executor.extraClassPath
    • spark.jars
  11. Set the key values to the path of the Spark Solr connector jar (e.g. /home/user1/spark-solr-3.5.1-shaded-modified.jar).

Note: In the example above, the path is a local system path. In case of a multi-node HDP cluster, please make sure that this jar is available under the same path on all of the nodes. Another option is to put the jar in HDFS and specify the HDFS location.

3. Setup DSX Desktop

This code pattern was tested using DSX Desktop, a lightweight version of DSX Local intended for standalone use and optimized for local development. For information on how to set it up, see the DSX Desktop Install Docs.

For production deployments it is recommended to use DSX Local with a three node configuration. For more information, refer to the DSX Local Installation Guide.

When installing DSX Desktop, choose Jupyter Notebooks with Anaconda (Python 2.7).

4. Clone the repo

Now that our HDP and DSX Local are installed and configured, we can proceed with making the two communicate with each other. We start by cloning the hdp-search-spark-recommender repository locally. In a terminal, run:

git clone https://github.com/IBM/hdp-search-spark-recommender.git
cd hdp-search-spark-recommender

5. Download and move the data to HDFS

In this code pattern we will be using the Movielens dataset, it contains ratings given by a set of users for movies, as well as movie metadata. For the purpose of this code pattern, we can download the "latest small" version of the data set.

This code pattern is targeted to run against a multi-node HDP cluster. Therefore, the dataset needs to be moved to HDFS. Move the data to HDFS by issuing the following commands from the Ambari Server:

mkdir /tmp/data
cd /tmp/data
wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
unzip ml-latest-small.zip
hadoop fs -mkdir /movies
hadoop fs -put *.csv /movies

6. Setup python plugins

This code pattern relies upon a few python plugins for everything to work together. Some plugins are required to be installed in the node where DSX Local or DSX Desktop is installed and the others need to be installed in all the HDP compute nodes. The following table lists which plugins are required and where they need to be installed:

Library Install Location
tmdbsimple DSX node and all nodes of HDP cluster
IPython DSX node
paramiko All nodes of HDP cluster
numpy All nodes of HDP cluster
simplejson DSX node and all nodes of HDP cluster
urllib2 DSX node and all nodes of HDP cluster
solrcloudpy DSX node and all nodes of HDP cluster

Note regarding DSX node plugins: The Jupyter notebook running in DSX Desktop will attempt to install each of the required plugins, so no futher action should be needed. If, however, there are issues and a plugin needs to be manually installed, see the Troubleshooting section below for instructions.

The plugins can be installed using the pip command.

If the pip command is not available on your platform, please follow the platform specific instructions to install pip on your machine.

For example:

pip install numpy

Note: Some of the plugins may already be installed/present in your environment. In that case please skip that plugin and move on to the next plugin in the table. Similarly, installing the plugins may require installing some prerequsite plugins on your platform. In this case, please follow the platform specific instuctions to install the prerequsite plugins first.

7. Launch the notebook

The notebook provided in this repo should work with Python 2.7.x or 3.x (and has been tested on 2.7.11 and 3.6.1)

To run the notebook you will need to start DSX Local. Below are the steps that need to be performed before running the notebook the very first time.

  1. Create a new project.

    • From the DSK Desktop dashboard, click on Create project.

    • Give the project a name and click the Create button.

  2. Create the notebook.

    • From the project dashboard view, click on Create notebook.

    • Select the From URL tab to specify the URL to the notebook in this repository.

    • Enter this URL:

      https://raw.githubusercontent.com/IBM/hdp-search-spark-recommender/master/notebooks/solr-hdp-spark-recommender.ipynb
      
    • Click the Create button.

  3. Load the notebook.

    Once created, you can click on the notebook to load and run it. Once running, you will see a number of variables that need to be replaced with values that match your installed environment. These variables are:

     * `LIVY_URL`
     * `HDFS_URL_FOR_DATA`
     * `SOLR_INSTALL_DIR`
     * `SOLR_HOST`
     * `SOLR_PORT`
     * `ZKHOST`
     * `SSHUSER`
     * `SSHPASSWORD`
     * `tmdb.API_KEY`
    

    Some variables such as tmdb.API_KEY, SOLR_HOST, etc. are defined twice in the notebook as they are needed in both the DSX node and the HDP cluster nodes.

    Note that in the notebook, the HDP cluster nodes are accessed via the Livy interface. Cells that access the HDP cluster are distinguishable by the following tag located at the top of the cell:

    %%spark
    

8. Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

  • A blank, this indicates that the cell has never been executed.
  • A number, this number represents the relative order this code step was executed.
  • A *, this indicates that the cell is currently executing.

There are several ways to execute the code cells in your notebook:

  • One cell at a time.
    • Select the cell, and then press the Play button in the toolbar. You can also hit Shift+Enter to execute the cell and move to the next cell.
  • Batch mode, in sequential order.
    • From the Cell menu bar, there are several options available. For example, you can Run All cells in your notebook, or you can Run All Below, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.

Sample output

After classifying the data by movies, ratings and genres, we train our recommender model.

After loading our model into Solr, we can predict which movies people will like based on ratings of similar movies in the same genre.

The example output in the data/examples folder shows the output of the notebook after running it in full. View it here.

Troubleshooting

  • Error: Should DSX Desktop fail to respond at any time.

    Solution: Perform the following steps:

      * Close IBM DSX desktop
      * Kill all the processes related to DSX Local from terminal:
    
    	```
      docker ps |grep dsx
      docker rm -f <container id>
      ```
    
    • Remove DSX Desktop metadata

       cd ~/Library/Application \Support/
       rm -rf ibm-dsx-desktop
      
    • Restart DSX Desktop

    • Load the jupyter notebook again

  • Error: Missing Required Header for CSRF protection.

    If you see this error when trying to create spark session through livy (the cell below), it means that you need to disable the CSRF in the livy property.

    livyURL="http://host name:8999"
    %spark add -s spark -l python -u $livyURL
    

    Solution: Go to the Disable Apache Livy CSRF protection section for instructions.

  • Error: 404 Client Error: Not Found for url: http://your hostname:8983/solr/ratings/schema

    If you see this error when running this cell

     %%spark
     setupSchema()
     conn = SolrConnection(SOLR_HOST_PORT, version="6.6.0")
    

    go to Solr's UI, and click the Logging on the left side, and check the latest logs, if showing

    Error: org.apache.solr.common.SolrException: Error CREATEing SolrCore 'movies_vector_shard1_replica1': Unable to create core [movies_vector_shard1_replica1] Caused by: /solr/movies_vector/core_node 1/data/index/write.lock for client 172.17.0.2 already exists

    Solution: Find your solr data/index directory which is specified in conf/solrconfig.xml, delete only the write.log in data/index folder.

  • Error: HTTPError: 401 Client Error: Unauthorized for url: https://api.themoviedb.org/3/movie/1893?api_key=...

    If you see this error in your notebook while testing your TMDb API access, or generating recommendations, it means you have installed tmdbsimple Python package, but have not set up your API key.

    Solution: You will need to access The Movie Database API and follow theses instructions to get an API key. Then copy the key into the tmdb.API_KEY = 'YOUR_API_KEY' line in the notebook. Once you have done that, execute that cell to test your access to TMDb API.

  • Error: Notebook running in DSX Desktop is unable to install python packages or plugins

    Occasionally, this can happen due to the presence of multiple python environments in the docker container that runs the notebook. If this occurs, we can install the plugins manually by performing the following steps:

    1. Find out the python environment that is used by the notebook by inserting a cell at the top of the notebook. Enter and run the following commands:
    !which python
    !which pip
    
    1. In a terminal, type docker ps to list all the docker containers.
    2. Find out the container id of your notebook by looking for the image name dsx-desktop:ana27.
    3. Start a interactive bash shell on the container by running the following command:
      docker exec -it <container_id> bash
      
    4. At the prompt, enter the following command to install the plugin, for example solrcloudpy (this example assumes that the pip command location determined by Step 1 is /opt/conda/bin/pip):
    /opt/conda/bin/pip install --ignore-installed -U solrcloudpy
    

Links

Learn more

  • Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
  • AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
  • Watson Studio: Master the art of data science with IBM's Watson Studio
  • Spark on IBM Cloud: Need a Spark cluster? Create up to 30 Spark executors on IBM Cloud with our Spark service

License

Apache 2.0