- Installing H2O Sparkling Water on EMR
Created by gh-md-toc
This guide assumes Spark 2.4 is already provisioned with the EMR cluster. All tests have been performed using following SW versions:
- EMR 5.23.0
- Spark 2.4
- H2O Sparkling Water 2.4.11
- Anaconda 5.2.0 (Python 3.6.5)
You should be able to use the scripts and procedure unmodified as long as your EMR cluster has Spark 2.4 . You will need to make changes to the deployment script is the SW versions above change.
This guide provides set of instructions to set up Python based environment to run H2O Sparling Water on AWS EMR (with Spark) cluster.
The Python run-time environment set up is automated and is based on Anaconda Python distribution. Once complete, the steps will provide following configuration:
- Latest Anaconda Python 3.6 distribution.
- Jupyter Notebook/Lab set up and running under separate OS user
- H2O Sparkling Water 2.4.11
- Example of starter Notebook to demonstrate creating:
- Spark Session to connect to YARN cluser
- H2O context running side by side with Spark session
This is guide doesn't cover steps to provision EMR cluster. You could find many guides helping you to provision your first EMR cluster. When provisioning EMR cluster make sure you read "Prerequisites section" first.
All steps listed below are fully automated, and are part of the aws_python_tools_install.py installation file.
In this step we download and install Python Anaconda distribution. The distribution provides handy set of majority of the Python libraries. Many of the big organizations certify Anaconda distribution for use on its company servers.
The goal of this step is to perform all necessary steps to create Jupyter notebook environment.
-
Create OS user
jupyterlab
-
The password for the user is created using random password generator. If you need to know the password, you will need to reset it using
sudo
access fromhadoop
OS user. -
Define enviornment variables and place them into
.bash_profile
:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export PATH
export PATH=/mnt/opt/anaconda520:/mnt/opt/anaconda520/bin:$PATH
export PYSPARK_PYTHON=/mnt/opt/anaconda520/bin/python
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export MASTER="yarn"
-
Initial token is generated when notebook starts for the first time. The token can be found by running following command:
sudo cat /home/jupyterlab/notebooks/jupyter_notebook.log | grep token
-
The notebook is running using port
28888
. You can modify the port after restart. Following command is used to start the notebook:
sudo su - jupyterlab -c 'nohup jupyter notebook --no-browser --port 28888 >> /home/jupyterlab/notebooks/jupyter_notebook.log 2>&1 &'
Following steps are automated and are part of the script.
mkdir /mnt/opt
cd /mnt/opt
wget https://s3.amazonaws.com/h2o-release/sparkling-water/rel-2.4/11/sparkling-water-2.4.11.zip
unzip sparkling-water-2.4.11.zip
The following steps are performed to validate working environment and ability to create Spark Session and H2O context from Jupyter notebook.
-
In this step we assume you have already opened port 28888 or using Browser pluging to route Web traffic to/from EMR
-
Connect to Jupyter notebook using EMR host name and port 28888. Optionally you can specify token all in one command. To find the info:
sudo cat /home/jupyterlab/notebooks/jupyter_notebook.log | grep
Example URL: http://ip-172-31-46-145:28888/tree?token=eede425006d001d6bdeb9d8afb2c910e87f3a8137ab38516
- Upload provided notebook
H2O Sparkling Water EMR starter.ipynb
and repeat the steps. The notebook can be found under thenotebooks
folder of this Github repo. - The Starter notebook can be found here H2O Sparkling Water EMR starter.ipynb
You can validate H2O Sparkling Water installation by running below commands and scala
shell.
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export MASTER="yarn"
bin/sparkling-shell \
--master yarn \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=2g \
--conf spark.driver.memory=2g
The parameter spark.ext.h2o.fail.on.unsupported.spark.param
is set because otherwise following error is raised on EMR:
java.lang.IllegalArgumentException: Unsupported argument: (spark.dynamicAllocation.enabled,true)
Start H2O sessions in Scala shell
import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._
If successful following message is displayed:
scala> h2oContext
res0: org.apache.spark.h2o.H2OContext =
Sparkling Water Context:
* H2O name: sparkling-water-hadoop_application_1558311857271_0004
* cluster size: 3
* list of used nodes:
(executorId, host, port)
------------------------
(2,ip-172-31-12-99.ec2.internal,54321)
(1,ip-172-31-10-203.ec2.internal,54321)
(3,ip-172-31-10-203.ec2.internal,54323)
------------------------
Open H2O Flow in browser: http://ip-172-31-14-136.ec2.internal:54321 (CMD + click in Mac OSX)
* Yarn App ID of Spark application: application_1558311857271_0004
Create sample DataFrame and validate from H2O UI:
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
someDF.show()
+------+-----+
|number| word|
+------+-----+
| 8| bat|
| 64|mouse|
| -27|horse|
+------+-----+
scala> val hfNamed: H2OFrame = h2oContext.asH2OFrame(someDF, Some("h2oframe"))
hfNamed: org.apache.spark.h2o.H2OFrame =
Frame key: h2oframe
cols: 2
rows: 3
chunks: 3
size: 487