Skip to content

Latest commit

 

History

History

spark-jupyter

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

One-Off Coder Logo

Purpose

This docker container is meant to be used for learning purpose for programming PySpark. It has the following components.

  • Hadoop v3.2.1
  • Spark v2.4.4
  • Conda 3 with Python v3.7

After running the container, you may visit the following pages.

As can be seen, Jupyter Lab is running on port 8888. An example notebook is mounted at /root/ipynb. To get the PySpark code to run, you will have to upload the data.csv file to HDFS first. View the example notebook.

Docker

To run the container.

docker run -it \
    -p 9870:9870 \
    -p 8088:8088 \
    -p 8080:8080 \
    -p 18080:18080 \
    -p 9000:9000 \
    -p 8888:8888 \
    -p 9864:9864 \
    -v $HOME/git/docker-containers/spark-jupyter/ubuntu/root/ipynb:/root/ipynb \
    -e PYSPARK_MASTER=spark://localhost:7077 \
    spark-jupyter:local

To run the container with a password.

docker run -it \
    -p 9870:9870 \
    -p 8088:8088 \
    -p 8080:8080 \
    -p 18080:18080 \
    -p 9000:9000 \
    -p 8888:8888 \
    -p 9864:9864 \
    -v $HOME/git/docker-containers/spark-jupyter/ubuntu/root/ipynb:/root/ipynb \
    -e PYSPARK_MASTER=spark://localhost:7077 \
    -e NOTEBOOK_PASSWORD=sha1:6676da7235c8:9c7d402c01e330b9368fa9e1637233748be11cc5 \
    spark-jupyter:local

Stuff to do after going into the container e.g. docker exec -it <id> /bin/bash

# test yarn
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar pi 1 50

# test spark against yarn
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    100

# test spark standalone
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master spark://localhost:7077 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    100

# start a scala spark shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077

# start a python spark shell
pyspark --master spark://localhost:7077 > /tmp/jupyter.log 2>&1 &

# start a python spark shell against yarn
pyspark \
    --driver-memory 2g \
    --executor-memory 2g \
    --num-executors 1 \
    --executor-cores 1 \
    --conf spark.driver.maxResultSize=8g \
    --conf spark.network.timeout=2000 \
    --queue default \
    --master yarn-client > /tmp/jupyter.log 2>&1 &

Docker Hub

Image

References