Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48664] Add Apache Spark 4.0.0-preview1 Dockerfiles #61

Merged
merged 4 commits into from
Jun 19, 2024

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

add docker files for 4.0.0-preview1

Why are the changes needed?

new release

Does this PR introduce any user-facing change?

new release

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @Yikun @yaooqinn

@yaooqinn
Copy link
Member

Thank you @cloud-fan for publishing the docker image, can we make the CI pass?

@cloud-fan
Copy link
Contributor Author

27.02 gpg: Can't check signature: No public key

I have no idea why it fails...

@cloud-fan
Copy link
Contributor Author

I just followed #33

wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
export GNUPGHOME="$(mktemp -d)"; \
wget -nv -O KEYS https://downloads.apache.org/spark/KEYS; \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn do you know why other Spark versions' dockerfiles are fine without this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, (it might because your key is new and theirs are cached)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related: #54 (comment)

we need upload the key to https://keys.openpgp.org/upload

@cloud-fan
Copy link
Contributor Author

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/opt/entrypoint.sh": permission denied: unknown.

let me check the file permissions.

@yaooqinn
Copy link
Member

It seems that we need to bump up mini kube to fix ITs like apache/spark#44813

Comment on lines 263 to 264
# curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
# TODO(SPARK-44495): Resume to use the latest minikube for k8s-integration-tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these comments

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CIs

@yaooqinn yaooqinn changed the title Add Apache Spark 4.0.0-preview1 Dockerfiles [SPARK-48664] Add Apache Spark 4.0.0-preview1 Dockerfiles Jun 19, 2024
@cloud-fan
Copy link
Contributor Author

The merge script didn't work for me, I'll just use the merge PR button in github.

@cloud-fan cloud-fan merged commit 300ae79 into apache:master Jun 19, 2024
24 checks passed
@cloud-fan
Copy link
Contributor Author

@Yikun I think we need to update add-dockerfiles.sh. Spark 4.x uses scala 2.13 now.

"4.0.0-preview1-scala2.13-java17-python3-ubuntu",
"4.0.0-preview1-java17-python3",
"4.0.0-preview1-java17",
"python3-java17"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With hindsight, I think we should have a more strict naming rule for tags. For example, python3-java17 is a bad name as we don't even know which spark version it is.

I'd propose: spark_version-scala_version-java_version-[python support]-[r support]-OS

if the spark version only supports one scala or java version, we can omit the scala/java version in the name.

What do you think? @Yikun @yaooqinn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the name.

Is -OS necessary as a JVM app or not?If necessary, the OS versions or codenames like Focal Fossa also matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah we can omit OS name as well. If one day we publish docker images with different OS, we can add back the OS in the tag name.

Copy link
Member

@Yikun Yikun Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some new case in 4.0.0, so we can't just take the 3.5 version java17 version as reference:

  • java17 as default version
  • scala2.13 as default version

the version rule in original design, there are 4 type special version:

latest: the latest version of current majar.minor version
3.3.0: the 3.3.0 version with default scala/python3
python3/r/scala: latest python3/r/scala version
3.3.0-r/3.3.0-scala/3.3.0-python3:  the 3.3.0 version with r/scala only/python3

If we publish the 4.0.0 someday, it should be:

- 4.0.0-scala2.13-java17-python3-ubuntu:

    "4.0.0-scala2.13-java17-python3-ubuntu"
    "4.0.0-python3",
    "4.0.0",                 // for user don't care spark pyspark/java/scala version
    "python3",           // for pyspark user but don't care spark java/scala version
    "latest"                // for user but don't care spark pyspark/java/scala version

- 4.0.0-scala2.13-java17-r-ubuntu

    "4.0.0-scala2.13-java17-r-ubuntu"
    "4.0.0-r",              // for sparkr 4.0.0 user but don't care spark java/scala version
    "r",                       // for sparkr user but don't care spark java/scala version

- 4.0.0-scala2.13-java17-ubuntu

    "4.0.0-scala2.13-java17-r-ubuntu"
    "4.0.0-scala",        // for sparkr 4.0.0 scala user but don't care spark java/scala version
    "scala",                 // for sparkr scala user but don't care spark java/scala version

- 4.0.0-scala2.13-java17-python3-r-ubuntu

    "4.0.0-scala2.13-java17-python3-r-ubuntu"

so for 4.0.0-preview at this point:

    {
      "path": "4.0.0-preview1/scala2.13-java17-python3-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-python3-ubuntu",
        "4.0.0-preview1-python3",
      ]
    },
    {
      "path": "4.0.0-preview1/scala2.13-java17-r-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-r-ubuntu",
        "4.0.0-preview1-r"
      ]
    },
    {
      "path": "4.0.0-preview1/scala2.13-java17-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-ubuntu",
        "4.0.0-preview1-scala"
      ]
    },
    {
      "path": "4.0.0-preview1/scala2.13-java17-python3-r-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-python3-r-ubuntu"
      ]
    },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to believe people don't care about spark versions...

Copy link
Member

@Yikun Yikun Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the idea is should we want to remove python3, scala, r special tag?

original:

latest: the latest version of current majar.minor version
3.3.0: the 3.3.0 version with default scala/python3
python3/r/scala: latest python3/r/scala version
3.3.0-r/3.3.0-scala/3.3.0-python3:  the 3.3.0 version with r/scala only/python3

tobe:

latest: the latest version of current majar.minor version
3.3.0: the 3.3.0 version with default scala/python3
3.3.0-r/3.3.0-scala/3.3.0-python3:  the 3.3.0 version with r/scala only/python3

It's hard to believe people don't care about spark versions...

  • IIRC, the background of python3/r/scala is that there were 3 images before (spark, spark-py, spark-r), then spark --> scala, spark-py --> python3, spark-r --> r .
  • BTW, the spark:python3, spark:r is also used by Run now in https://spark.apache.org/

also cc @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should provide a single image with all supports (Python and R). I don't think package size is a big concern here. We can also provide variants with different scala and java versions.

Copy link
Member

@Yikun Yikun Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should provide a single image with all supports (Python and R). I don't think package size is a big concern here.

There were some discussion on it [1] (only 150MB plus by adding r), the 4.0.0-scala2.13-java17-python3-r-ubuntu is all-in-one image

[1] https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o/edit?disco=AAAAf2TyFr0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the all-in-one image isn't that big. Why do we provide other variants that remove python or r support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 reason I think:

  1. To cover kubernetes best practice, to keep images size as small as possible.
  2. As I said before, there were spark-py / spark-r / spark before, to align old version images.

@yaooqinn
Copy link
Member

I have seen these tags present at https://hub.docker.com/r/apache/spark/tags but absent at https://hub.docker.com/_/spark

It seems that we need to a PR like docker-library/official-images#16325 to publish Docker Official Images

@yaooqinn
Copy link
Member

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry for the late reply

wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
export GNUPGHOME="$(mktemp -d)"; \
wget -nv -O KEYS https://downloads.apache.org/spark/KEYS; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related: #54 (comment)

we need upload the key to https://keys.openpgp.org/upload

"4.0.0-preview1-scala2.13-java17-python3-ubuntu",
"4.0.0-preview1-java17-python3",
"4.0.0-preview1-java17",
"python3-java17"
Copy link
Member

@Yikun Yikun Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some new case in 4.0.0, so we can't just take the 3.5 version java17 version as reference:

  • java17 as default version
  • scala2.13 as default version

the version rule in original design, there are 4 type special version:

latest: the latest version of current majar.minor version
3.3.0: the 3.3.0 version with default scala/python3
python3/r/scala: latest python3/r/scala version
3.3.0-r/3.3.0-scala/3.3.0-python3:  the 3.3.0 version with r/scala only/python3

If we publish the 4.0.0 someday, it should be:

- 4.0.0-scala2.13-java17-python3-ubuntu:

    "4.0.0-scala2.13-java17-python3-ubuntu"
    "4.0.0-python3",
    "4.0.0",                 // for user don't care spark pyspark/java/scala version
    "python3",           // for pyspark user but don't care spark java/scala version
    "latest"                // for user but don't care spark pyspark/java/scala version

- 4.0.0-scala2.13-java17-r-ubuntu

    "4.0.0-scala2.13-java17-r-ubuntu"
    "4.0.0-r",              // for sparkr 4.0.0 user but don't care spark java/scala version
    "r",                       // for sparkr user but don't care spark java/scala version

- 4.0.0-scala2.13-java17-ubuntu

    "4.0.0-scala2.13-java17-r-ubuntu"
    "4.0.0-scala",        // for sparkr 4.0.0 scala user but don't care spark java/scala version
    "scala",                 // for sparkr scala user but don't care spark java/scala version

- 4.0.0-scala2.13-java17-python3-r-ubuntu

    "4.0.0-scala2.13-java17-python3-r-ubuntu"

so for 4.0.0-preview at this point:

    {
      "path": "4.0.0-preview1/scala2.13-java17-python3-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-python3-ubuntu",
        "4.0.0-preview1-python3",
      ]
    },
    {
      "path": "4.0.0-preview1/scala2.13-java17-r-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-r-ubuntu",
        "4.0.0-preview1-r"
      ]
    },
    {
      "path": "4.0.0-preview1/scala2.13-java17-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-ubuntu",
        "4.0.0-preview1-scala"
      ]
    },
    {
      "path": "4.0.0-preview1/scala2.13-java17-python3-r-ubuntu",
      "tags": [
        "4.0.0-preview1-scala2.13-java17-python3-r-ubuntu"
      ]
    },

Yikun added a commit that referenced this pull request Jun 24, 2024
### What changes were proposed in this pull request?
- Remove `wget -nv -O KEYS https://downloads.apache.org/spark/KEYS` and `gpg --import KEYS;`, it works but not meet the [security concern](https://github.com/docker-library/official-images?tab=readme-ov-file#security) from DOI.
- add `add-dockerfiles.sh 4.0.0-preview1` support to address #61 (comment)
- Fix `versions.json`: consider java17 as default version, so we can remove java17 tag.

### Why are the changes needed?
update 4.0.0-preview1

### Does this PR introduce _any_ user-facing change?
new release

### How was this patch tested?
./add-dockerfiles.sh 4.0.0-preview1 ,  no diff

Closes #63 from Yikun/4.0.0-preview1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants