Refer this repo: venv-template and follow the guide mentioned in the repo
Refer this repo: spark-examples and follow the guide.
For the purposes of this documentation, there have been some modifications to the spark-examples repo and some of the documentation has been summarized.
-
Prepare the virtual environment to install required Python dependencies.
cd spark-examples
which contains a simple spark application that calculates the value of Pi using sparkcd spark_env
inside the spark-examples folder- Edit
requirements.txt
- Create the virtual environment by calling
./setup.sh
- To recreate the virtual environment, simple delete the folder
./venv
-
Pick the Number of nodes by adjusting the line
#SBATCH --nodes=2
instart_spark_cluster.sh
orstart-develbooster-spark.sh
already is configured to run on development node where one can get resources without waiting a lot but is limited to 2 hrs of compute time.
To start your spark cluster on the cluster, simply run:
sbatch start-develbooster-spark.sh
Note: The start-develbooster-spark.sh
bash script is slurm batch script and it has all the configurations that you need to start a spark job on the JUWELS cluster. More info about how to write a Batch system using Slurm can be found here: Juwels slurm batch system
Once you have submitted the batch job, a job id is displayed in the console. One has to wait for a while until resources are allocated. The start-develbooster-spark.sh
has these variables defined
#SBATCH --mail-user="your@email.com""
#SBATCH --mail-type=ALL
You should get an email notification as soon as your job is run. You can also open another terminal window, ssh into the juwels login node and watch when a resource has been allocated by executing: watch -n 5 squeue --me
which updates every 5 seconds.
Once resources have been allocated, you should see something like this:
[username@jwlogin23 spark-examples]$ watch -n 5 squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6525353 develboos spark-cl username R 00:01 2 jwb[0129,0149]
In the above the output, 0129 is the master node and 0149 is the worker node. In a case with more than two nodes, the first node in the NODELIST is always the master node and the rest are the worker nodes.
In a terminal where you sshed into JUWELS, and submitted the slurm batch script, activate the python venv and run this command: export MASTER_URL=spark://jwb0129i.juwels:4124
Next, just run the python script: python pyspark_pi.py
. This job should take few seconds and an output with the value of Pi will be shown.
The Spark History Server is a User Interface that is used to monitor the metrics and performance of the completed Spark applications. There were some configuration changes that one has to do ensure that the history server runs smoothly.
In start-develbooster-spark.sh, there was configuration export SPARK_CONF_DIR=$SLURM_SUBMIT_DIR
where the $SLURM_SUBMIT_DIR contains a spark-defaults.conf
spark.driver.memory 2g
spark.executor.memory 2g
spark.executor.cores 2
spark.yarn.executor.memoryOverhead 512
spark.default.parallelism 4
spark.history.ui.port 18080
spark.history.acls.enable true
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.retainedApplications 100
spark.history.fs.update.interval 10s
spark.eventLog.enabled true
spark.eventLog.dir file:///p/home/jusers/kamathbola1/juwels/project/kamathbola1/repos/spark-examples/spark-history
spark.history.fs.logDirectory file:///p/project/opengptx-elm/kamathbola1/repos/spark-examples/spark-history
spark.eventLog.enabled
, spark.eventLog.dir
and spark.history.fs.logDirectory
are the important parameters that have to be defined in order to get the spark history server working
To actually view the history server one has to port forward and ssh jump via the login node to the master node of the spark history server. E.g: In this case the command would be something like this: ssh -L 18080:localhost:18080 -L 8080:localhost:8080 USERNAME@jwb0129i.juwels -i /path/to/private/sshkey -J USERNAME@juwels-booster.fz-juelich.de
Note: The history server is started when the slurm batch job is submitted.
And then in your browser, just goto localhost:18080 to view the spark history server. In case of redirects when viewing other logs, please note the URL in your browser. If it is of the format: jwb0129i:18080 change it to localhost:18080.
Note: The history server only runs until your slurm job allocation exists. i.e. Since it runs on the master node, once your access to master node stops, the history server stops. To get the history server back up and running, use scp or rsync to copy the "spark-history" folder to your local machine and note down the path. Please refer the official spark documentation. Please link the conf file when starting the spark history server ./start/history-server.sh --properties-file /path/to/history-server.conf
.
The history-server.conf should look like this:
spark.history.ui.port 18080
spark.history.acls.enable true
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.retainedApplications 100
spark.history.fs.update.interval 10s
spark.eventLog.enabled true
spark.eventLog.dir file:///path/to/local/spark-history
spark.history.fs.logDirectory file:///path/to/local/spark-history
#spark.io.compression.codec snappy