Skip to content

A fast way of getting a Spark cluster up and running on AWS with the friendly IPython interface.

License

Notifications You must be signed in to change notification settings

eleflow/sparknotebook

Repository files navigation

Spark Notebook

Build Status License

The Spark Notebook project is fast way of getting a Spark cluster up and running on AWS with the friendly IPython interface.

Before you start

You'll need

  1. to have Docker installed (recommended) or no docker setup
  2. AWS access keys
  3. One AWS keypair

Setup

  1. git clone https://github.com/eleflow/sparknotebook.git
  2. cd sparknotebook
  3. create a aws.deploy.env file with these:
AWS_ACCESS_KEY_ID=<YOUR AWS ACCESS KEY>
AWS_SECRET_ACCESS_KEY=<YOUR AWS SEECRET ACCESS KEY>
AWS_KEY_PAIR=<YOUR AWS KEY PAIR NAME>
  1. Run

$ docker build --rm -f=aws.deploy.Dockerfile -t=aws.deploy .

Running the Notebook on AWS

  1. Run sudo docker run -it --env-file ./aws.deploy.env --volume $PWD:/sparknotebook --volume $HOME/.ssh:/.ssh aws.deploy and if all goes well you will see the ip of your sparknotebook server in a line like this
...

PLAY RECAP ******************************************************************** 
52.10.183.42               : ok=21   changed=3    unreachable=0    failed=0   
  1. Where 52.10.183.42 will be replaced with another ip address. Put that ip address on your browser to get access to the notebook

Spark Notebook

Spark Notebook kernel is deployed into your server, and you can access it through the port 80, using an HTTP browser. The initial notebook state is showed in the picture below:

Alt text

To start a new notebook, just click in the New Notebook button, and you will be redirected to a new tab, containing an empty notebook. The notebook is a code container that contains multiple TextArea components, where you can insert any kind of Scala code, including multi lines scripts. To execute the desired code, put the focus into the code TextArea component and hit Shift + ENTER or click in the play button (positioned at the notebook Header). Each time that you submit a code to the notebook, it will be compiled and if it compiles, it will be executed.

Cluster Settings

One of the cluster settings you are likely to change is the number of slaves. To change it to 30, you can run this code on the Spark Noteook

  ClusterSettings.coreInstanceCount = 30 // Number of workers available in your cluster - default to 3

To see other settings see ClusterSettings

SparkContext

A SparkContext can be accessed with:

  sparkContext

This is a method of SparkNotebookContext and it provisions the machines and sets up the cluster the first time it runs. An example of output of this method is showed below:

Alt text

Shutdown

To shutdown the cluster and terminate the cluster master and slaves run:

    terminate

Monitoring

Ganglia

The master instance of your cluster also has a monitoring tool named Ganglia installed and it's address is displayed when you create the SparkContext. Ganglia is a useful tool that help you to monitor the CPU, memory and disk usage, displaying graphs of this components. JVM data like, gc executions. It's very useful to help you to setup the correct cluster size, for your tasks. The ganglia address is printed in the screen during the cluster instantiation. It's always deployed to the masterhost:5080/ganglia address. It's important to note that the information showed at ganglia has a little delay.

Local build

To build and run locally go here

License

This project is distributed under Apache License Version 2.0