Installing H2O Sparkling Water on EMR - Python (Jupyter) environment

  # .bash_profile
  # Get the aliases and functions
  if [ -f ~/.bashrc ]; then
          . ~/.bashrc
  fi
  # User specific environment and startup programs
  PATH=$PATH:$HOME/.local/bin:$HOME/bin
  export PATH
  export PATH=/mnt/opt/anaconda520:/mnt/opt/anaconda520/bin:$PATH
  export PYSPARK_PYTHON=/mnt/opt/anaconda520/bin/python
  export SPARK_HOME=/usr/lib/spark
  export HADOOP_CONF_DIR=/etc/hadoop/conf
  export MASTER="yarn"

Initial token is generated when notebook starts for the first time. The token can be found by running following command: sudo cat /home/jupyterlab/notebooks/jupyter_notebook.log | grep token
The notebook is running using port 28888 . You can modify the port after restart. Following command is used to start the notebook:

sudo su - jupyterlab -c 'nohup jupyter notebook --no-browser --port 28888 >> /home/jupyterlab/notebooks/jupyter_notebook.log 2>&1 &'

Step 3 - Download and install H2O Sparkling Water

Following steps are automated and are part of the script.

mkdir /mnt/opt
cd /mnt/opt
wget https://s3.amazonaws.com/h2o-release/sparkling-water/rel-2.4/11/sparkling-water-2.4.11.zip
unzip sparkling-water-2.4.11.zip

Step 4 - Jupyter notebook

The following steps are performed to validate working environment and ability to create Spark Session and H2O context from Jupyter notebook.

In this step we assume you have already opened port 28888 or using Browser pluging to route Web traffic to/from EMR
Connect to Jupyter notebook using EMR host name and port 28888. Optionally you can specify token all in one command. To find the info:

sudo cat /home/jupyterlab/notebooks/jupyter_notebook.log | grep

Example URL: http://ip-172-31-46-145:28888/tree?token=eede425006d001d6bdeb9d8afb2c910e87f3a8137ab38516

Upload provided notebook H2O Sparkling Water EMR starter.ipynb and repeat the steps. The notebook can be found under the notebooks folder of this Github repo.
The Starter notebook can be found here H2O Sparkling Water EMR starter.ipynb

Step 5 (Optional) - Start H2O in Scala Shell

You can validate H2O Sparkling Water installation by running below commands and scala shell.

export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
export MASTER="yarn"
   
bin/sparkling-shell \
--master yarn \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=2g \
--conf spark.driver.memory=2g

The parameter spark.ext.h2o.fail.on.unsupported.spark.param is set because otherwise following error is raised on EMR:

java.lang.IllegalArgumentException: Unsupported argument: (spark.dynamicAllocation.enabled,true)

Start H2O sessions in Scala shell

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._

If successful following message is displayed:

scala> h2oContext
res0: org.apache.spark.h2o.H2OContext =

Sparkling Water Context:
 * H2O name: sparkling-water-hadoop_application_1558311857271_0004
 * cluster size: 3
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (2,ip-172-31-12-99.ec2.internal,54321)
  (1,ip-172-31-10-203.ec2.internal,54321)
  (3,ip-172-31-10-203.ec2.internal,54323)
  ------------------------

  Open H2O Flow in browser: http://ip-172-31-14-136.ec2.internal:54321 (CMD + click in Mac OSX)


 * Yarn App ID of Spark application: application_1558311857271_0004

Create sample DataFrame and validate from H2O UI:

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

someDF.show()
+------+-----+
|number| word|
+------+-----+
|     8|  bat|
|    64|mouse|
|   -27|horse|
+------+-----+


scala> val hfNamed: H2OFrame = h2oContext.asH2OFrame(someDF, Some("h2oframe"))
hfNamed: org.apache.spark.h2o.H2OFrame =
Frame key: h2oframe
   cols: 2
   rows: 3
 chunks: 3
   size: 487

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
set_up_scripts		set_up_scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installing H2O Sparkling Water on EMR - Python (Jupyter) environment

Table of Contents

Prerequisites

Guide Purpose

Create and login to EMR Cluster

Installation Steps

Step 1 - Install Python Anaconda Distribution

Step 2 - Install and configure Jupyter notebook/lab environment

Step 3 - Download and install H2O Sparkling Water

Step 4 - Jupyter notebook

Step 5 (Optional) - Start H2O in Scala Shell

About

Releases

Packages

Languages

License

us8945/AWS_EMR_Pysparkling

Folders and files

Latest commit

History

Repository files navigation

Installing H2O Sparkling Water on EMR - Python (Jupyter) environment

Table of Contents

Prerequisites

Guide Purpose

Create and login to EMR Cluster

Installation Steps

Step 1 - Install Python Anaconda Distribution

Step 2 - Install and configure Jupyter notebook/lab environment

Step 3 - Download and install H2O Sparkling Water

Step 4 - Jupyter notebook

Step 5 (Optional) - Start H2O in Scala Shell

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages