The Spark Notebook project is fast way of getting a Spark cluster up and running on AWS with the friendly IPython interface.
You'll need
- to have Docker installed (recommended) or no docker setup
- AWS access keys
- One AWS keypair
- git clone https://github.com/eleflow/sparknotebook.git
- cd sparknotebook
- create a aws.deploy.env file with these:
AWS_ACCESS_KEY_ID=<YOUR AWS ACCESS KEY>
AWS_SECRET_ACCESS_KEY=<YOUR AWS SEECRET ACCESS KEY>
AWS_KEY_PAIR=<YOUR AWS KEY PAIR NAME>
- Run
$ docker build --rm -f=aws.deploy.Dockerfile -t=aws.deploy .
- Run
sudo docker run -it --env-file ./aws.deploy.env --volume $PWD:/sparknotebook --volume $HOME/.ssh:/.ssh aws.deploy
and if all goes well you will see the ip of your sparknotebook server in a line like this
...
PLAY RECAP ********************************************************************
52.10.183.42 : ok=21 changed=3 unreachable=0 failed=0
- Where 52.10.183.42 will be replaced with another ip address. Put that ip address on your browser to get access to the notebook
Spark Notebook kernel is deployed into your server, and you can access it through the port 80, using an HTTP browser. The initial notebook state is showed in the picture below:
To start a new notebook, just click in the New Notebook button, and you will be redirected to a new tab, containing an empty notebook. The notebook is a code container that contains multiple TextArea components, where you can insert any kind of Scala code, including multi lines scripts. To execute the desired code, put the focus into the code TextArea component and hit Shift + ENTER or click in the play button (positioned at the notebook Header). Each time that you submit a code to the notebook, it will be compiled and if it compiles, it will be executed.
One of the cluster settings you are likely to change is the number of slaves. To change it to 30, you can run this code on the Spark Noteook
ClusterSettings.coreInstanceCount = 30 // Number of workers available in your cluster - default to 3
To see other settings see ClusterSettings
A SparkContext can be accessed with:
sparkContext
This is a method of SparkNotebookContext and it provisions the machines and sets up the cluster the first time it runs. An example of output of this method is showed below:
To shutdown the cluster and terminate the cluster master and slaves run:
terminate
The master instance of your cluster also has a monitoring tool named Ganglia installed and it's address is displayed when you create the SparkContext. Ganglia is a useful tool that help you to monitor the CPU, memory and disk usage, displaying graphs of this components. JVM data like, gc executions. It's very useful to help you to setup the correct cluster size, for your tasks. The ganglia address is printed in the screen during the cluster instantiation. It's always deployed to the masterhost:5080/ganglia address. It's important to note that the information showed at ganglia has a little delay.
To build and run locally go here
This project is distributed under Apache License Version 2.0