Name		Name	Last commit message	Last commit date
parent directory ..
ubuntu		ubuntu
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
deploy.sh		deploy.sh
tag.sh		tag.sh

README.md

Purpose

This docker container is meant to be used for learning purpose for programming PySpark. It has the following components.

Hadoop v3.2.1
Spark v2.4.4
Conda 3 with Python v3.7

After running the container, you may visit the following pages.

As can be seen, Jupyter Lab is running on port 8888. An example notebook is mounted at /root/ipynb. To get the PySpark code to run, you will have to upload the data.csv file to HDFS first. View the example notebook.

Docker

To run the container.

docker run -it \
    -p 9870:9870 \
    -p 8088:8088 \
    -p 8080:8080 \
    -p 18080:18080 \
    -p 9000:9000 \
    -p 8888:8888 \
    -p 9864:9864 \
    -v $HOME/git/docker-containers/spark-jupyter/ubuntu/root/ipynb:/root/ipynb \
    -e PYSPARK_MASTER=spark://localhost:7077 \
    spark-jupyter:local

To run the container with a password.

docker run -it \
    -p 9870:9870 \
    -p 8088:8088 \
    -p 8080:8080 \
    -p 18080:18080 \
    -p 9000:9000 \
    -p 8888:8888 \
    -p 9864:9864 \
    -v $HOME/git/docker-containers/spark-jupyter/ubuntu/root/ipynb:/root/ipynb \
    -e PYSPARK_MASTER=spark://localhost:7077 \
    -e NOTEBOOK_PASSWORD=sha1:6676da7235c8:9c7d402c01e330b9368fa9e1637233748be11cc5 \
    spark-jupyter:local

Stuff to do after going into the container e.g. docker exec -it <id> /bin/bash

# test yarn
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar pi 1 50

# test spark against yarn
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    100

# test spark standalone
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master spark://localhost:7077 \
    $SPARK_HOME/examples/jars/spark-examples*.jar \
    100

# start a scala spark shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077

# start a python spark shell
pyspark --master spark://localhost:7077 > /tmp/jupyter.log 2>&1 &

# start a python spark shell against yarn
pyspark \
    --driver-memory 2g \
    --executor-memory 2g \
    --num-executors 1 \
    --executor-cores 1 \
    --conf spark.driver.maxResultSize=8g \
    --conf spark.network.timeout=2000 \
    --queue default \
    --master yarn-client > /tmp/jupyter.log 2>&1 &

Docker Hub

Image

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-jupyter

spark-jupyter

README.md

Purpose

Docker

Docker Hub

References

Files

spark-jupyter

Directory actions

More options

Directory actions

More options

Latest commit

History

spark-jupyter

Folders and files

parent directory

README.md

Purpose

Docker

Docker Hub

References