Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(build): spark 3.5.2 #42

Merged
merged 15 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 73 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -135,10 +135,82 @@ jobs:
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "8"
hadoop: "3.3.4"
scala: "2.12"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "8"
hadoop: "3.3.4"
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "8"
hadoop: "3.3.6"
scala: "2.12"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "8"
hadoop: "3.3.6"
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "11"
hadoop: "3.3.4"
scala: "2.12"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "11"
hadoop: "3.3.4"
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "11"
hadoop: "3.3.6"
scala: "2.12"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "11"
hadoop: "3.3.6"
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "17"
hadoop: "3.3.4"
scala: "2.12"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "17"
hadoop: "3.3.4"
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "17"
hadoop: "3.3.6"
scala: "2.12"
with_hive: "true"
with_pyspark: "true"
- spark: "3.5.1"
java: "17"
hadoop: "3.3.6"
scala: "2.13"
with_hive: "true"
with_pyspark: "true"
runs-on: ubuntu-20.04
env:
IMAGE_NAME: "spark-k8s"
SELF_VERSION: "v3"
SELF_VERSION: "v4"
SPARK_VERSION: "${{ matrix.version.spark }}"
HADOOP_VERSION: "${{ matrix.version.hadoop }}"
SCALA_VERSION: "${{ matrix.version.scala }}"
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# CHANGELOG

## v4
- Drop support for all Spark version less than 2.y.z
- Add Spark 3.5.1
- Add Hadoop 3.3.6
- Add support for Java 17 for Spark 3.5.1
- Fix Ubuntu-based images to use `jre-focal` variant instead of `jre` which was recently upgraded to Ubuntu Jammy to v22.y.z and causing system level python package installation to fail due to [PEP 668](https://issues.apache.org/jira/browse/SPARK-49068)
Copy link

@tyng94 tyng94 Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol i'm pretty sure this guy (Chao Sun) who reported the issue was a singaporean colleague at my previous company

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time to drop him a message to share with him how to fix

- Remove docker tags without self version

## v3

- (Temporarily drop support for R due to keyserver issues)
Expand Down
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,20 +13,21 @@ Debian:
- `3.2.2`
- `3.1.3`
- `3.4.1`
- `3.5.1`

## Note
## Notes

(R builds are temporarily suspended due to keyserver issues at current time.)

Build image for Spark 3.4.1 is Ubuntu based because openjdk is deprecated and
going forward the official Spark repository uses `eclipse-temurin:<java>-jre`
Build image for Spark 3.4.1/3.5.1 is Ubuntu based because openjdk is deprecated and
going forward the official Spark repository uses `eclipse-temurin:<java>-jre-focal`
where slim variants of jre images are not available at the moment.

All the build images with Spark before v3.4.0 are Debian based as the official
Spark repository now uses `openjdk:<java>-jre-slim-buster` as the base image
for Kubernetes build. Because currently the official Dockerfiles do not pin
the Debian distribution, they are incorrectly using the latest Debian `bullseye`,
which does not have support for Python 2, and its Python 3.9 do not work well
which does not have support for Python 2, and its Python 3.9 do not work well
with PySpark.

Hence some Dockerfile overrides are in-place to make sure that Spark 2 builds
Expand All @@ -48,11 +49,11 @@ For quick testing of local build, you should do the following commands:

```bash
export IMAGE_NAME=spark-k8s
export SELF_VERSION="v3"
export SELF_VERSION="v4"
export SCALA_VERSION="2.12"
export SPARK_VERSION="3.3.0"
export HADOOP_VERSION="3.3.2"
export JAVA_VERSION="11"
export SPARK_VERSION="3.5.1"
export HADOOP_VERSION="3.3.6"
export JAVA_VERSION="17"
export WITH_HIVE="true"
export WITH_PYSPARK="true"
bash make-distribution.sh
Expand Down
8 changes: 4 additions & 4 deletions build.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
export IMAGE_NAME=spark-k8s
export SELF_VERSION="v3"
export SELF_VERSION="v4"
export SCALA_VERSION="2.12"
export SPARK_VERSION="3.3.0"
export HADOOP_VERSION="3.3.2"
export JAVA_VERSION="11"
export SPARK_VERSION="3.5.2"
export HADOOP_VERSION="3.3.6"
export JAVA_VERSION="8"
export WITH_HIVE="true"
export WITH_PYSPARK="true"
bash make-distribution.sh
41 changes: 4 additions & 37 deletions make-distribution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,57 +40,25 @@ TERM=xterm-color ./dev/make-distribution.sh \
${HIVE_INSTALL_FLAG:+"-Phive"} \
-DskipTests

SPARK_MAJOR_VERSION="$(echo "${SPARK_VERSION}" | cut -d '.' -f1)"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore as we no longer use any spark version that is 2.y.z

HADOOP_MAJOR_VERSION="$(echo "${HADOOP_VERSION}" | cut -d '.' -f1)"
HIVE_HADOOP3_HIVE_EXEC_URL=${HIVE_HADOOP3_HIVE_EXEC_URL:-https://github.com/guangie88/hive-exec-jar/releases/download/1.2.1.spark2-hadoop3/hive-exec-1.2.1.spark2.jar}

# Replace Hive for Hadoop 3 since Hive 1.2.1 does not officially support Hadoop 3 when using Spark 2.y.z
# Note docker-image-tool.sh takes the jars from assembly/target/scala-2.*/jars
if [[ "${WITH_HIVE}" = "true" ]] && [[ "${SPARK_MAJOR_VERSION}" -eq 2 ]] && [[ "${HADOOP_MAJOR_VERSION}" -eq 3 ]]; then
HIVE_EXEC_JAR_NAME="hive-exec-1.2.1.spark2.jar"
TARGET_JAR_PATH="$(find assembly -type f -name "${HIVE_EXEC_JAR_NAME}")"
curl -LO "${HIVE_HADOOP3_HIVE_EXEC_URL}" && mv "${HIVE_EXEC_JAR_NAME}" "${TARGET_JAR_PATH}"
# Spark <= 2.4 uses ${TARGET_JAR_PATH} for Docker COPY, but Spark >= 3 uses dist/jars/
cp "${TARGET_JAR_PATH}" "dist/jars/"
fi

SPARK_MAJOR_VERSION="$(echo "${SPARK_VERSION}" | cut -d '.' -f1)"
SPARK_MINOR_VERSION="$(echo "${SPARK_VERSION}" | cut -d '.' -f2)"
HADOOP_MAJOR_VERSION="$(echo "${HADOOP_VERSION}" | cut -d '.' -f1)"

if [[ ${SPARK_MAJOR_VERSION} -eq 2 && ${SPARK_MINOR_VERSION} -eq 4 ]]; then # 2.4.z
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore as we no longer use any spark version that is 2.y.z

# Same Dockerfiles as Spark v2.4.8, but allow override of base image to use Debian Buster
# and not using PYTHONENV and instead copies pyspark out like Spark 3.y.z
DOCKERFILE_BASE="../overrides/base/2.4.z/Dockerfile"
DOCKERFILE_PY="../overrides/python/2.4.z/Dockerfile"
else
DOCKERFILE_BASE="./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile"
DOCKERFILE_PY="./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile"
fi
DOCKERFILE_BASE="./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile"
DOCKERFILE_PY="./resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile"

if [[ ${SPARK_MAJOR_VERSION} -eq 3 && ${SPARK_MINOR_VERSION} -ge 4 ]]; then # >=3.4
# From Spark v3.4.0 onwards, openjdk is not the prefered base image source as it i
# deprecated and taken over by eclipse-temurin. slim-buster variants are not available
# on eclipse-temurin at the moment.
IMAGE_VARIANT="jre"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jre-8 no longer works properly with the current setup as it is upgraded to Ubuntu 22, and the python within that version raises an error when user install python packages globally. However, Dockerfile packaged in Spark still installs packages globally (see prev GH action logs here

image

IMAGE_VARIANT="jre-focal"
else
IMAGE_VARIANT="jre-slim-buster"
fi

# Temporarily remove R build due to keyserver issue
# DOCKERFILE_R="./resource-managers/kubernetes/docker/src/main/dockerfiles/R/Dockerfile"

SPARK_LABEL="${SPARK_VERSION}"
TAG_NAME="${SELF_VERSION}_${SPARK_LABEL}_hadoop-${HADOOP_VERSION}_scala-${SCALA_VERSION}_java-${JAVA_VERSION}"

# ./bin/docker-image-tool.sh \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R is not used anymore

# -b java_image_tag=${JAVA_VERSION}-jre-slim-buster \
# -r "${IMAGE_NAME}" \
# -t "${TAG_NAME}" \
# -f "${DOCKERFILE_BASE}" \
# -p "${DOCKERFILE_PY}" \
# -R "${DOCKERFILE_R}" \
# build

./bin/docker-image-tool.sh \
-b java_image_tag=${JAVA_VERSION}-${IMAGE_VARIANT} \
-r "${IMAGE_NAME}" \
Expand All @@ -101,6 +69,5 @@ TAG_NAME="${SELF_VERSION}_${SPARK_LABEL}_hadoop-${HADOOP_VERSION}_scala-${SCALA_

docker tag "${IMAGE_NAME}/spark:${TAG_NAME}" "${IMAGE_NAME}:${TAG_NAME}"
docker tag "${IMAGE_NAME}/spark-py:${TAG_NAME}" "${IMAGE_NAME}-py:${TAG_NAME}"
# docker tag "${IMAGE_NAME}/spark-r:${TAG_NAME}" "${IMAGE_NAME}-r:${TAG_NAME}"

popd >/dev/null
15 changes: 2 additions & 13 deletions push-images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,10 @@ else
fi

TAG_NAME="${SELF_VERSION}_${SPARK_LABEL}_hadoop-${HADOOP_VERSION}_scala-${SCALA_VERSION}_java-${JAVA_VERSION}"
ALT_TAG_NAME="${SPARK_LABEL}_hadoop-${HADOOP_VERSION}_scala-${SCALA_VERSION}_java-${JAVA_VERSION}"

docker tag "${IMAGE_NAME}:${TAG_NAME}" "${IMAGE_ORG}/${IMAGE_NAME}:${TAG_NAME}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to only use image tags that has self version to prevent overwritten previously working images when rebuild with new base jre images

docker tag "${IMAGE_NAME}:${TAG_NAME}" "test_${IMAGE_ORG}/${IMAGE_NAME}:${TAG_NAME}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to revert test_ prefix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks for the catch 6c00c0e

docker push "${IMAGE_ORG}/${IMAGE_NAME}:${TAG_NAME}"
docker tag "${IMAGE_NAME}:${TAG_NAME}" "${IMAGE_ORG}/${IMAGE_NAME}:${ALT_TAG_NAME}"
docker push "${IMAGE_ORG}/${IMAGE_NAME}:${ALT_TAG_NAME}"

# Python image push
docker tag "${IMAGE_NAME}-py:${TAG_NAME}" "${IMAGE_ORG}/${IMAGE_NAME}-py:${TAG_NAME}"
docker tag "${IMAGE_NAME}-py:${TAG_NAME}" "test_${IMAGE_ORG}/${IMAGE_NAME}-py:${TAG_NAME}"
docker push "${IMAGE_ORG}/${IMAGE_NAME}-py:${TAG_NAME}"
docker tag "${IMAGE_NAME}-py:${TAG_NAME}" "${IMAGE_ORG}/${IMAGE_NAME}-py:${ALT_TAG_NAME}"
docker push "${IMAGE_ORG}/${IMAGE_NAME}-py:${ALT_TAG_NAME}"

# R image push
# docker tag "${IMAGE_NAME}-r:${TAG_NAME}" "${IMAGE_ORG}/${IMAGE_NAME}-r:${TAG_NAME}"
# docker push "${IMAGE_ORG}/${IMAGE_NAME}-r:${TAG_NAME}"
# docker tag "${IMAGE_NAME}-r:${TAG_NAME}" "${IMAGE_ORG}/${IMAGE_NAME}-r:${ALT_TAG_NAME}"
# docker push "${IMAGE_ORG}/${IMAGE_NAME}-r:${ALT_TAG_NAME}"
7 changes: 6 additions & 1 deletion templates/vars.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
self_version: 'v3'
self_version: 'v4'

versions:
- spark: ['3.1.3']
Expand All @@ -19,4 +19,9 @@ versions:
- spark: ['3.4.1']
java: ['8', '11']
hadoop: ['3.3.4']
scala: ['2.12', '2.13']

- spark: ['3.5.1']
java: ['8', '11', '17']
hadoop: ['3.3.4', '3.3.6']
scala: ['2.12', '2.13']
Loading