Hadoop Spark Cluster

This project is based on the work of avp38

Its is to create a Hadoop/Spark Cluster in a VirtualBox Environment that can be interfaced using Jupyter Lab.

Cluster Architecture

VM	HDFS	YARN	Spark	JupyterLab
head	NameNode		Master
body		ResourceManager JobHistoryServer ProxyServer
slave1	DataNode	NodeManager	Slave
slave2	DataNode	NodeManager	Slave
jupyter				JupyterServer

Webinterfaces

Service	URL
NameNode	http://localhost:50070/dfshealth.html
ResourceManager	http://localhost:18088/cluster
JobHistory	http://localhost:19888/jobhistory
Spark	http://localhost:8080
JupyterLab(optional)	http://localhost:54321

Setup Host

Download

Install

Ubuntu

Create an USB-Stick with Rufus (Win), or the Ubuntu Disk Creator tool and install Ubuntu Desktop

Git

$ sudo apt install git-all

Docker

$ sudo apt update
$ sudo apt install apt-transport-https ca-certificates curl software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
$ apt-cache policy docker-ce
$ sudo apt install docker-ce
$ sudo systemctl status docker

Docker can be run by any user:

$ sudo usermod -aG docker ${USER}
$ su - ${USER}
$ groups
$ sudo usermod -aG docker $USER

Docker Compose

Git

$ sudo curl -L "https://github.com/docker/compose/releases/download/v2.0.1/docker-compose-linux-x86_64" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose
$ docker-compose --version

VirtualBox

Install Package

Vagrant

$ curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
$ sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
$ sudo apt-get update && sudo apt-get install vagrant
$ vagrant plugin install vagrant-vbguest

Setup Cluster

$ git clone https://github.com/datainsightat/hadoop_spark_cluster.git
$ cd Hadoop-Spark-Environment/resources
$ wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
$ sudo vagrant up
$ vagrant vbguest --do install

Initialze Hadoop

(Apache Hadoop)[http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html] (Apache Spark)[https://spark.apache.org/docs/latest/running-on-yarn.html]

$ vagrant ssh head
head $ hdfs namenode -format hadoop_cluster

Start Hadoop

$ vagrant ssh head

head $ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
head $ sshpass -p vagrant $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
head $ jps

$ vagrant ssh body

body $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
body $ sshpass -p vagrant $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager
body $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
body $ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
body $ jps

Test yarn

body $ yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 2 100

Start Spark

$ vagrant ssh head
head $ sshpass -p vagrant $SPARK_HOME/sbin/start-all.sh

Test Spark

head $ $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --num-executors 10 --executor-cores 2 $SPARK_HOME/examples/jars/spark-examples*.jar 100

head $ $SPARK_HOME/bin/spark-shell --master spark://head:7077

Operate Cluster

Start

Stop

$ cd PATH_TO_CLUSTER
$ vagrant ssh head
head $ sshpass -p vagrant $SPARK_HOME/sbin/stop-dfs.sh
$ vagrant ssh body
body $ sshpass -p vagrant $SPARK_HOME/sbin/stop-yarn.sh
$ vagrant ssh head
head $ sshpass -p vagrant $SPARK_HOME/sbin/stop-all.sh
$ cd PATH_TO_CLUSTER
$ sudp vagrant halt

Update Hadoop, Spark

If you update the cluster to new versions of hadoop and spark, use these commands to start the cluster:

 body $ yarn --daemon start resourcemanager --config $HADOOP_CONF_DIR 
 body $ yarn --daemon start nodemanager --config $HADOOP_CONF_DIR 
 body $ yarn --daemon start proxyserver --config $HADOOP_CONF_DIR
 body $ yarn --daemon start timelineserver --config $HADOOP_CONF_DIR
 
 head $ hdfs --daemon start namenode --config $HADOOP_CONF_DIR
 head $ hdfs --daemon start datanode --config $HADOOP_CONF_DIR

Setup Jupyter Lab (optional)

Download

Ubuntu 20.04.03 Server

Virtualbox: Settings > Network > Advanced > Port Forwarding > Host Port: 54321, GuestPort: 8888

Install

$ wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
$ sudo apt-get install libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6
$ bash ~/Downloads/Anaconda3-2021.05-Linux-x86_64.sh
$ vim .basrc
  export PATH=~/anaconda3/bin:$PATH
$ source .bashrc
$ conda create --name jupyterlab
$ conda activate jupyterlab
$ conda install -c r r r-essentials r-irkernel
$ jupyter lab --generate-config
$ vim ~/.jupyter/jupyter_lab_config.py
  c.ServerApp.token = ''
  c.NotebookApp.ip = '*'
  c.ServerApp.open_browser = False
$ jupyter lab

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cluster		cluster
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Spark Cluster

Cluster Architecture

Webinterfaces

Setup Host

Download

Install

Ubuntu

Git

Docker

Docker Compose

VirtualBox

Vagrant

Setup Cluster

Initialze Hadoop

Start Hadoop

Test yarn

Start Spark

Test Spark

Operate Cluster

Start

Stop

Update Hadoop, Spark

Setup Jupyter Lab (optional)

Download

Install

About

Releases

Packages

Languages

datainsightat/virtualbox_hadoop_spark

Folders and files

Latest commit

History

Repository files navigation

Hadoop Spark Cluster

Cluster Architecture

Webinterfaces

Setup Host

Download

Install

Ubuntu

Git

Docker

Docker Compose

VirtualBox

Vagrant

Setup Cluster

Initialze Hadoop

Start Hadoop

Test yarn

Start Spark

Test Spark

Operate Cluster

Start

Stop

Update Hadoop, Spark

Setup Jupyter Lab (optional)

Download

Install

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages