This repository contains the configuration and build files necessary to produce the quay.io/geodocker/jupyter-geopyspark
Docker image.
The Docker image allows easy use of GeoPySpark in a web browser via Jupyter and GeoNotebook without having to modify or configure the host computer (beyond what is needed to run Docker).
The process of using a pre-built container is discussed in the next section, and instructions for building the image and modifying it are also discussed.
You will be prompted for a username and a password when direct your web browser to the container: the username and password are both hadoop
.
One can use the image with or without making a clone of this repository.
To use the image without (or from outside of) a clone of this repository, first make sure that you are in possession of the image. The command
docker pull quay.io/geodocker/jupyter-geopyspark
will pull the latest version of the image.
The container can then be started by typing
docker run -it --rm --name geopyspark \
-p 8000:8000 -p 4040:4040 \
quay.io/geodocker/jupyter-geopyspark
or perhaps
docker run -it --rm --name geopyspark \
-p 8000:8000 -p 4040:4040 \
-v $HOME/.aws:/home/hadoop/.aws:ro \
quay.io/geodocker/jupyter-geopyspark
if you wish to have your AWS credentials available in the container (e.g. for pulling data from S3).
To use the image from within a clone of this repository,
there are two useful targets in the Makefile: run
and run-editable
.
To use the run
target, type something like
TAG=latest make run
or to use the run
target with some image other than the latest one, something like
TAG=a1b78b9 make run
will launch a container using the image quay.io/geodocker/jupyter-geopyspark:a1b78b9
.
The run-editable
target also exists, which attempts to map one's local clone of the GeoPySpark into the container so that that code can be edited and iterated on in a fairly convenient fashion.
By default, it is assumed that the GeoPySpark code is present in ../geopyspark/geopyspark
, but that assumption can be changed by passing in an alternate location through the GEOPYSPARK_DIR
environment variable.
Here
TAG=latest GEOPYSPARK_DIR=/tmp/geopyspark/geopyspark run-editable
is an example of that.
Both of those targets also pay attention to the EXTRA_FLAGS
environment variable which can be used to pass additional flags to docker.
To build the image, type make all
, make image
, or simply make
.
Type
make run
to run the newly-built image.
The TAG
environment variable is not set, so by default the run
target will use the tag of the new image.
In this section we describe the structure of the repository and document how the various pieces interact as part of the build process.
archives
is an initially-empty directory that is populated with source code and built artifacts as part of the build process.blobs
is an initially-empty directory that is populated with built artifacts from thearchives
directory. This directory exists becausearchives
is listed in the.dockerignore
file (which was done to reduce the size of the build context of the final image). Please see the README in that directory for more information.config
contains the GeoNotebook configuration file and a list of python dependencies that GeoNotebook requires.emr-docker
contains files useful for running the image on Amazon EMR (please see the README in that directory for more information).terraform-docker
contains file useful for running the image on Amazon EMR using Terraform. Its remit is similar to that of the directory mentioned in the previous bullet-point, but it uses Terraform instead of shell scripts.kernels
contains Jupyter kernel configuration files. The one most likely to be of interest is the one that enables GeoNotebook and GeoPySpark, the other two kernels are mostly vestigial/ceremonial.notebooks
contains various sample notebooks.scratch
is a scratch directory used during the build process. The files that are added under this directory during the build can be harmlessly deleted after the build is complete, but not doing so will accelerate subsequent builds.scripts
contains various scripts using for building and installing artifacts.netcdf.sh
builds a jar from a particular branch of the Thredds project that provides support for reading NetCDF files.build-python-blob1.sh
runs in the context of the AWS build container, its purpose is to acquire most of the python dependencies needed by GeoPySpark and GeoNotebook and package them together into a tarball for later installation.build-pytohn-blob2.sh
runs in the context of the AWS build container, its purpose is to package GeoPySpark andGeoPySpark-NetCDF
into a tarball for later installation.install-blob1.sh
runs in the context of the final image build. Its purpose is to install the artifacts created earlier bybuild-python-blob1.sh
.install-blob2.sh
runs in the context of the final image build. Its purpose is to install the artifacts created earlier bybuild-python-blob2.sh
.
Dockerfile
specifies the final image, the output of the build process.Makefile
coordinates the build process.README.md
this file.
The build process can be divided into three stages: the bootstrap image creation phase, the EMR-compatible artifact creation stage, and the final image build stage.
When the all
makefile target is invoked, the last two stages of the three-stage build process are done.
The first of the three stages is done using the contents of the rpms/build
directory.
Its results have already been pushed to the quay.io/geodocker
docker repository, so unless the reader wishes to modify the bootstrap images, this stage can be considered complete.
To rebuild the boostrap images, the reader should navigate into the rpms/build
directory and run the ./build.sh
script.
The purpose of this stage is to build python artifacts that need to be linked against those binary dependencies which have been built in a context that resembles EMR (because we want the image to be usable on EMR).
First, a tarball containing python code linked against the binary dependencies mentioned above is created. Then, another python tarball containing GeoPySpark is created. The reason that there are two python tarballs instead of one is simply because contents of the two tarballs change at different rates; over repeated builds, the first tarball is built less frequently than the second one.
In the third of the three stages, the artifacts which were created earlier are brought together and installed into the final docker image.
As an example of how to make a meaningful modification to the image, in this section we will describe the process of adding new binary dependencies to the image.
Currently, all binary dependencies are located in the file gdal-and-friends.tar.gz
which comes in via the quay.io/geodocker/jupyter-geopyspark:base-2
image on which the final image is based.
If we want to add an additional binary dependency inside of that file,
then we only need to download or otherwise acquire the source code
and update the build script to build and package the additional code.
If we wish to add a binary dependency outside of the gdal-and-friends.tar.gz
file, then the process is slightly more involved,
but potentially faster because it is not necessary to rebuild bootstrap images.
The strategy for adding new binary dependency, hypothetically libHelloWorld
packaged in a file called helloworld-and-friends.tar.gz
,
will be to mirror the process for gdal-and-friends.tar.gz
to the extent that we can.
The difference is that this time we will add the binary to the final image rather than to a bootstrap image.
- First, augment to the
Makefile
to download or otherwise ensure the existence of thelibHelloWorld
source code. - Next, we want to build and package
libHelloWorld
in the context of the AWS build image, so that it will be usable on EMR. This would probably be done by first creating a script analogous to the one for GDAL that builds, links, and archives the dependency. - That script should run in the context of the AWS build container so that the created binaries are compiled and linked in an environment that resembles EMR.
- The resulting archived binary blob should then be added to the final image so that it can be distributed to the Spark executors.
That should probably be done by adding a the
COPY
command to the Dockerfile to copy the new blob to the/blobs
directory of the image. - Finally, the image environment and the kernel should both be modified to make use of the new dependency.
The former will probably involve the addition of an
ENV
command to the Dockerfile to augment theLD_LIBRARY_PATH
environment variable to be able to find any new shared libraries; The latter is described below.
The changes to the kernel described in the last bullet-point would probably look something like this
@@ -14,6 +14,6 @@
"PYTHONPATH": "/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip:/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip",
"GEOPYSPARK_JARS_PATH": "/opt/jars",
"YARN_CONF_DIR": "/etc/hadoop/conf",
- "PYSPARK_SUBMIT_ARGS": "--archives /blobs/gdal-and-friends.tar.gz,/blobs/friends-of-geopyspark.tar.gz,/blobs/geopyspark-sans-friends.tar.gz --conf spark.executorEnv.LD_LIBRARY_PATH=gdal-and-friends.tar.gz/lib --conf spark.executorEnv.PYTHONPATH=friends-of-geopyspark.tar.gz/:geopyspark-sans-friends.tar.gz/ --conf hadoop.yarn.timeline-service.enabled=false pyspark-shell"
+ "PYSPARK_SUBMIT_ARGS": "--archives /blobs/helloworld-and-friends.tar.gz,/blobs/gdal-and-friends.tar.gz,/blobs/friends-of-geopyspark.tar.gz,/blobs/geopyspark-sans-friends.tar.gz --conf spark.executorEnv.LD_LIBRARY_PATH=helloworld-and-friends.tar.gz/lib:gdal-and-friends.tar.gz/lib --conf spark.executorEnv.PYTHONPATH=friends-of-geopyspark.tar.gz/:geopyspark-sans-friends.tar.gz/ --conf hadoop.yarn.timeline-service.enabled=false pyspark-shell"
}
}
(The changes represented by the diff above have not been tested.)
The process for adding new distributed python dependencies is analogous to the one above,
except that changes to LD_LIBRARY_PATH
variable on the executors might not be required,
and additions most-probably will need to be made to the --conf spark.executorEnv.PYTHONPATH
configuration passed in via PYSPARK_SUBMIT_ARGS
in the kernel.
To build the RPMs, navigate into the rpms/build
directory and type ./build.sh
.
To use the RPM-based deployment, navigate into the terraform-nodocker
directory.
The configuration in that directory require Terraform version 0.10.6 or greater.
If you want to use Google OAuth, GitHub OAuth, or some supported generic type of OAuth, then type
terraform init
terraform apply
and respond appropriatly to the prompts.
Doing that will upload (or sync) the RPMs to the S3 location that you specify, and will also upload the terraform-nodocker/bootstrap.sh
bootstrap script.
If you do not wish to use OAuth, then some modifications to the bootstrap script will be required.
In to use OAuth for login, two things are necessary:
It is necessary to set three environment variables inside of the container before the JupyterHub process is launched, and
it is necessary to use a jupyterhub_config.py
file that enables the desired OAuth setup.
The three environment variables that must be set are OAUTH_CALLBACK_URL
, OAUTH_CLIENT_ID
, and OAUTH_CLIENT_SECRET
.
The first of those three variables should be set to http://localhost:8000/hub/oauth_callback
for local testing and something like http://$(hostname -f):8000/hub/oauth_callback
for deployment.
The second and third are dependent on the OAuth provider.
There three such files already included in the image: One for Google and related services, one for GitHub, and a generic one. There is some variability in precise details of how OAuth providers work (e.g. some require variables to be passed in the URL of a POST request, whereas others require variables to passed in the body of a POST request). For that reason, the generic configuration should be considered a starting point rather than something that is guranteed to work in its unmodified state.
There are only two user accounts in the image: root
and hadoop
.
All three of the configurations discussed above map all valid OAuth users to the hadoop
account.
That is done because -- without additional configuration -- Spark jobs on EMR must come from a user named "hadoop
".
(The users inside of the container are separate and distinct from those on the host instance,
but the username is evidently part of a Spark job submission, so it must match that of the user that EMR is expecting submissions from.)
To use OAuth, launch a container with the three variables supplied and with the appropriate jupyterhub_config.py
used.
docker run -it --rm --name geopyspark \
-p 8000:8000 \
-e OAUTH_CALLBACK_URL=http://localhost:8000/hub/oauth_callback \
-e OAUTH_CLIENT_ID=xyz \
-e OAUTH_CLIENT_SECRET=abc \
quay.io/geodocker/jupyter-geopyspark:latest \
jupyterhub \
-f /etc/jupterhub/jupyterhub_config_github.py \
--no-ssl --Spawner.notebook_dir=/home/hadoop/notebooks
This was discussed earlier.