Docker image for starting Apache Zeppelin.
Note: Must run binary
command with Java version 8 installed.
# Download dependencies
./build download --zeppelin_version=v0.8.2 --spark_version=2.4.3 --hadoop_version=2.7
# Build binaries
./build binary --zeppelin_version=v0.8.2 --spark_version=2.4.3 --hadoop_version=2.7
# Build docker container
./build docker --repo=${DOCKER_NAMESPACE:-datascienceplatform}/zeppelind --commit=$(git rev-parse --short HEAD)
./build all --zeppelin_version=v0.8.2 --spark_version=2.4.3 --hadoop_version=2.7
You can either start the image directly with Docker, or use the Nomad-Docker-Wrapper if you are running your containers on Nomad.
There are now 2 options for running Zeppelin; multi-user or single-user mode. If you wish to test the SSSD integration you will need to run the sssd container which is explained in the sssd project documentation. You need to share a volume /var/sssd/{{id}}/var/lib/sss/pipes:/var/lib/sss/pipes:rw
on both the zeppelin container and the sssd container. Start the sssd container prior to starting the zeppelin container.
# Multi-user login
docker run -p 8080:8080 \
-e ZEPPELIN_PROCESS_USER_NAME="zeppelin" \
-e ZEPPELIN_MEM="-Xmx1024m" \
-e ZEPPELIN_PROCESS_USER_ID=12345 \
-e ZEPPELIN_SERVER_PORT=8085 \
-e ZEPPELIN_SPARK_DRIVER_MEMORY="512M" \
-e ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.GitNotebookRepo \
-e ZEPPELIN_PROCESS_GROUP_NAME="DSP1_USERS" \
-e ZEPPELIN_PYSPARK_PYTHON=/usr/bin/python \
-e ZEPPELIN_SPARK_UI_PORT=4045 \
-e ZEPPELIN_PROCESS_GROUP_ID=12340 \
-e ZEPPELIN_SPARK_MASTER="local[*]" \
-e ZEPPELIN_PASSWORD="secret" \
-e ZEPPELIN_USER_TYPE=multiuser \
-v $(pwd)/notebooks:/usr/local/zeppelin/notebooks \
-v $(pwd)/conf:/usr/local/zeppelin/conf \
-v $(pwd)/hive:/hive \
-t pactosystems/zeppelind:f9d604cf-zv0.8.2-s2.4.3-h2.7
# Single-user login
docker run -p 8080:8080 \
-e ZEPPELIN_PROCESS_USER_NAME="zeppelin" \
-e ZEPPELIN_MEM="-Xmx1024m" \
-e ZEPPELIN_PROCESS_USER_ID=12345 \
-e ZEPPELIN_SERVER_PORT=8080 \
-e ZEPPELIN_SPARK_DRIVER_MEMORY="512M" \
-e ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.GitNotebookRepo \
-e ZEPPELIN_PROCESS_GROUP_NAME="DSP1_USERS" \
-e ZEPPELIN_PYSPARK_PYTHON=/usr/bin/python \
-e ZEPPELIN_SPARK_UI_PORT=4040 \
-e ZEPPELIN_PROCESS_GROUP_ID=12340 \
-e ZEPPELIN_SPARK_MASTER="local[*]" \
-e ZEPPELIN_PASSWORD="secret" \
-e ZEPPELIN_USER_TYPE=singleuser \
-v $(pwd)/notebooks:/usr/local/zeppelin/notebooks \
-v $(pwd)/conf:/usr/local/zeppelin/conf \
-v $(pwd)/hive:/hive \
-t pactosystems/zeppelind:f9d604cf-zv0.8.2-s2.4.3-h2.7
The docker image requires some environment variables to be set. They are used to configure your Zeppelin.
Variable | Description |
---|---|
ZEPPELIN_SPARK_MASTER |
URL of the Spark master that Zeppelin should use. |
ZEPPELIN_PASSWORD |
Password to use for authenticating as zeppelin user on the UI. |
ZEPPELIN_NOTEBOOK_STORAGE |
Notebook storage to use. |
ZEPPELIN_PROCESS_USER_NAME |
User name to execute the Zeppelin process as. |
ZEPPELIN_PROCESS_USER_ID |
User ID to execute the Zeppelin process as. |
ZEPPELIN_PROCESS_GROUP_NAME |
Group name to assign to the Zeppelin user. |
ZEPPELIN_PROCESS_GROUP_ID |
Group ID to assign to the Zeppelin user. |
ZEPPELIN_SERVER_PORT |
Port to bind the Zeppelin server to. |
ZEPPELIN_SPARK_UI_PORT |
Port to use for the Spark UI. |
ZEPPELIN_SPARK_DRIVER_MEMORY |
Amount of memory to allocate to the Spark driver process (e.g. 512M ). |
ZEPPELIN_PYSPARK_PYTHON |
Path to python executable for the Spark worker nodes. |
ZEPPELIN_MEM |
Zeppelin JVM Options |
These environment variables should be defined and set appropriately in your Travis CI Settings.
https://travis-ci.org/<your travis account name>/docker-zeppelin/settings
Variable | Description |
---|---|
DOCKER_USER |
Your docker ID |
DOCKER_PASSWORD |
Password for your docker ID (make sure "display value in build log" is disabled for this env var) |
DOCKER_NAMESPACE |
Docker namespace, where the image(s) should be pushed; usually equals to DOCKER_USER; if empty, then "datascienceplatform" (that can lead to permission issues while pushing) |
SQL databases are supported trough SQLAlchemy (a python package with a comprehensive set of tools for working with databases). This container includes the following dialects:
Dialect | Target DB |
---|---|
pymssql |
Microsoft SQL Server |
psycopg2 |
PostgreSQL |
cx_Oracle |
Oracle |
Support for additional databases can be added by installing additional dialects into an anaconda environment and setting ZEPPELIN_PYSPARK_PYTHON
to the environment location.
%spark.pyspark
import sqlalchemy
from sqlalchemy import create_engine
conn_str = "mssql+pymssql://<User>:<Password>@<Host>:<Port>"
engine = create_engine(conn_str)
# List All DBs
res = engine.execute('SELECT * FROM master.sys.databases')
for row in res:
print row
%spark.pyspark
import sqlalchemy
from sqlalchemy import create_engine
conn_str = "postgresql+psycopg2:///<User>:<Password>@<Host>:<Port>"
engine = create_engine(conn_str)
# List All DBs
res = engine.execute('SELECT * FROM pg_database')
for row in res:
print row
%spark.pyspark
import sqlalchemy
from sqlalchemy import create_engine
conn_str = "oracle+cx_oracle:///<User>:<Password>@<Host>:<Port>/<db>"
engine = create_engine(conn_str)
res = engine.execute('SELECT * FROM my_table')
for row in res:
print row