feat(spark): integrate Spark operator in Kubeflow manifests #2889

GezimSejdiu · 2024-10-11T12:46:00Z

This PR introduces the integration of the Spark operator into the Kubeflow manifests, enabling users to run distributed Spark workloads directly within their Kubeflow environment. With this addition, Kubeflow users can leverage Spark for large-scale data processing as part of their machine learning (ML) pipelines, improving scalability and efficiency.

Key Changes:

Added Spark operator manifests to support Spark job execution.
Ensured integration of the Spark operator with Kubeflow pipelines, allowing users to submit and monitor Spark jobs within the Kubeflow UI.
Updated the documentation to include instructions on how to deploy and use the Spark operator within Kubeflow.
Configured role-based access control (RBAC) to secure Spark job submissions and resource usage.

Benefits:

Enables distributed data processing using Spark directly within Kubeflow.
Simplifies the management and orchestration of Spark jobs as part of ML workflows.
Provides an easy-to-use interface to monitor and manage Spark workloads.

How to Test:

Deploy the updated Kubeflow manifests, including the Spark operator.
Submit a sample Spark job via the Kubeflow pipeline UI.
Verify the job execution, logs, and results within the Spark UI and Kubeflow pipeline dashboard.

- Add Spark operator manifests for distributed Spark workloads. - Ensure integration with Kubeflow pipelines for seamless Spark job execution. Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

andreyvelich

Thank you so much for doing this @GezimSejdiu!
Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.

/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway

andreyvelich · 2024-10-11T15:09:44Z

/cc @kubeflow/wg-manifests-leads

juliusvonkohout · 2024-10-14T06:44:27Z

@andreyvelich @GezimSejdiu is a colleague of mine ;-) Yes, we can add it to applications directly since you seem to be fine with it as owner of spark-operator https://github.com/kubeflow/spark-operator/blob/3acd0f1a900a933e8612c1b4af55d29b1112cbf1/OWNERS#L2

juliusvonkohout · 2024-10-14T06:48:39Z

/ok-to-test

GezimSejdiu · 2024-10-14T08:18:01Z

Thank you so much for doing this @GezimSejdiu! Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.

/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway

Hey @andreyvelich ,

thank you. Good point. While thinking now, I still believe we can keep it into contrib in case, at some point we decide to merge efforts with Apache Spark Operator, but let us discuss that in another thread. @juliusvonkohout what do you think? Shall we move them into apps or still consider keeping (for now) under contrib.

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

juliusvonkohout · 2024-10-14T08:53:51Z

I am fine with starting in /contrib. If it works well, we can move to /apps. Especially if someone from the spark team wants to maintain it long-term here.

Fix some issues with tests, Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

GezimSejdiu · 2024-10-14T11:29:12Z

I'm currently testing it locally and will be working on it to make it work, but it seems webhook isn't working properly, I get:

2024-10-14T11:17:20.496Z	ERROR	webhook/start.go:254	Failed to sync webhook secret	{"error": "secrets \"spark-operator-webhook-certs\" is forbidden: User \"system:serviceaccount:kubeflow:spark-operator-webhook\" cannot get resource \"secrets\" in API group \"\" in the namespace \"default\""}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
	/workspace/cmd/operator/webhook/start.go:254
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
	/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
	/workspace/cmd/main.go:27
runtime.main
	/usr/local/go/src/runtime/proc.go:272

I think we will have to play with webhook configurations to make this work.

@andreyvelich feel free to suggest any solution for this issue.

andreyvelich · 2024-10-14T13:02:21Z

Thank you so much for doing this @GezimSejdiu! Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.
/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway

Hey @andreyvelich ,

thank you. Good point. While thinking now, I still believe we can keep it into contrib in case, at some point we decide to merge efforts with Apache Spark Operator, but let us discuss that in another thread. @juliusvonkohout what do you think? Shall we move them into apps or still consider keeping (for now) under contrib.

Given that Kubeflow community currently maintain this operator, I think we should just move it to apps since it is part of Kubeflow core components as described here: https://www.kubeflow.org/docs/started/architecture/#kubeflow-ecosystem
Furthermore, we have plans to release Kubeflow Spark Operator as part of Kubeflow 1.10 release.

Any concerns with that @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway?

juliusvonkohout · 2024-10-14T13:41:11Z

I'm currently testing it locally and will be working on it to make it work, but it seems webhook isn't working properly, I get:

2024-10-14T11:17:20.496Z	ERROR	webhook/start.go:254	Failed to sync webhook secret	{"error": "secrets \"spark-operator-webhook-certs\" is forbidden: User \"system:serviceaccount:kubeflow:spark-operator-webhook\" cannot get resource \"secrets\" in API group \"\" in the namespace \"default\""}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
	/workspace/cmd/operator/webhook/start.go:254
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
	/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
	/workspace/cmd/main.go:27
runtime.main
	/usr/local/go/src/runtime/proc.go:272

I think we will have to play with webhook configurations to make this work.

@andreyvelich feel free to suggest any solution for this issue.

You might need a networkpolicy as in https://github.com/kubeflow/manifests/blob/master/common/networkpolicies/base/training-operator-webhook.yaml for port 9443 https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23912 and i am wondering where the namespace "default" comes from. It should not be used in any way.

juliusvonkohout · 2024-10-14T13:44:53Z

I see for example https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23191 and many other places with the namespace default. It needs to be adjusted to run in the namespace "kubeflow." Maybe you can just do so in the main kustomization.yaml with a single line.

ImpSy · 2024-10-15T06:59:03Z

contrib/spark/spark-operator/base/kustomization.yaml

+      - op: add
+        path: /spec/template/spec/containers/0/securityContext
+        value:
+          runAsUser: 1000


The operator is based on spark image that uses 185 as non root user

Good catch. Let me then remove this as we have the non-root user by default on Spark.

Just runAsNonRoot:true is enough then

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

GezimSejdiu · 2024-10-15T09:44:59Z

I see for example https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23191 and many other places with the namespace default. It needs to be adjusted to run in the namespace "kubeflow." Maybe you can just do so in the main kustomization.yaml with a single line.

Indeed, that was the issue. I made it work by specifying the namespace while generating the resources.yaml and also removing the default namespace from spark.jobNamesapces (which is the default on the helm chart) to restrict to only that namespace.

I made it work now, but only without webhook. When I enable webhook I get this:

Error from server (InternalError): error when creating "sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:9443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.184.97:9443: i/o timeout

and when I set webhook.port=443 it gives me:

Error from server (InternalError): error when creating "sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.125.227:443: connect: connection refused

I saw that you override the service e.g. https://github.com/kubeflow/manifests/blob/master/apps/tensorboard/tensorboard-controller/upstream/webhook/service.yaml and I did try to do the same, but it doesn't seem to work.

Shall we consider disabling webhook or it is a must and we have to enforce such validation. If we can go ahead, I think this version is working already, I did test it locally. Let us also see / test CICD.

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

contrib/spark/README.md

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

…ebhook Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

juliusvonkohout · 2024-10-15T13:55:43Z

/lgtm
/approve

We might continue in follow up PRs.

google-oss-prow · 2024-10-15T13:55:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [juliusvonkohout]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2024-10-15T13:59:24Z

@GezimSejdiu @juliusvonkohout Can you create tracking issue to move Spark Operator into Kubeflow apps from contrib folder ?

juliusvonkohout · 2024-10-15T14:00:57Z

We can move it to /apps

and add /contrib/apache-spark-operator for the non-kubeflow apache operator, similar to kserve. We want to evaluate whether it has advantages to the kubeflow one.

andreyvelich · 2024-10-15T14:02:34Z

What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ?

andreyvelich · 2024-10-15T14:03:01Z

I guess, right now we suggest Kubeflow users to use Kubeflow Spark Operator given the community support.

andreyvelich · 2024-10-15T14:04:12Z

Additionally, the Kubeflow Spark Operator will be included in Kubeflow 1.10 release which means it should be part of Kubeflow apps in the manifest repo.
cc @rimolive

juliusvonkohout · 2024-10-15T16:04:32Z

What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ?

We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications.

andreyvelich · 2024-10-15T16:10:07Z

We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications.

Yes, that is possible with Kubeflow Spark Operator. @yuchaoran2011 and @vara-bonthu can share how that could be done.
We also have some open issue from end-user about it: kubeflow/spark-operator#2180

Also, @vikas-saxena02 has proposal to integrate it in Kubeflow Notebooks: https://docs.google.com/document/d/1Uvg3ykF7kIySVTY68xPKMEfy-tK-p_gaeXgidQaLCD4/edit?tab=t.0#heading=h.9jiz1e25qlob

vikas-saxena02 · 2024-10-15T22:32:12Z

@GezimSejdiu are you implementing the support with Kubeflow Pipelines v1 or v2 or both?

vikas-saxena02 · 2024-10-15T22:36:25Z

What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ?

We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications.

@juliusvonkohout I have been working on implementing the interactive support but seems like this will be a big one and will need support for Spark Connect to be implemented in SparkOperator first. This may take sometime but is definitely work in progress. Also, interactive spark jobs work well with jupyter enterprise gateway or Apache Levy, the proposal doc has this updated.

feat(manifests): integrate Spark operator in Kubeflow manifests

664be73

- Add Spark operator manifests for distributed Spark workloads. - Ensure integration with Kubeflow pipelines for seamless Spark job execution. Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

google-oss-prow bot requested review from juliusvonkohout and kimwnasptd October 11, 2024 12:46

google-oss-prow bot added size/XXL do-not-merge/invalid-owners-file labels Oct 11, 2024

andreyvelich reviewed Oct 11, 2024

View reviewed changes

google-oss-prow bot requested a review from a team October 11, 2024 15:09

google-oss-prow bot added the ok-to-test label Oct 14, 2024

fix: resolve permission issue for test spark script in makefile

007d37a

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

GezimSejdiu force-pushed the master branch from 0a244bd to 007d37a Compare October 14, 2024 08:28

chore: bump spark-operator version to v2.0.2

a6f16d4

Fix some issues with tests, Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

ImpSy reviewed Oct 15, 2024

View reviewed changes

fix: fix running spark pi-python example

cf5ecea

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>

GezimSejdiu force-pushed the master branch from a9835b9 to cf5ecea Compare October 15, 2024 09:38

networkpolices

45dcbd9

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

juliusvonkohout force-pushed the master branch from 34dcc45 to 45dcbd9 Compare October 15, 2024 12:21

juliusvonkohout added 4 commits October 15, 2024 14:23

enable webhook

e034995

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix networkpolicy

f997815

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix networkpolicy

e2f581a

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix yamllint

57cd8e1

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

andreyvelich reviewed Oct 15, 2024

View reviewed changes

contrib/spark/README.md Show resolved Hide resolved

juliusvonkohout added 12 commits October 15, 2024 14:37

fix yamllint

a46b25f

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix yamllint

efbe732

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix yamllint

1d87640

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix identation and newline for ymllint test

75701b0

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

set webhook port to 443

ac88dc1

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

disable istio injection for the webhook and set runasnonroot on the w…

bf622f9

…ebhook Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

disable istio injection and set runasnonroot and user 185

006365f

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

wait for the webhook to become ready

94dbcc4

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

pod logs and webhook port 9443

812203e

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

remove pod logs

5798db9

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

remove debug stuff

ed478f3

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

fix owners file

08b75e7

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>

google-oss-prow bot removed the do-not-merge/invalid-owners-file label Oct 15, 2024

google-oss-prow bot assigned juliusvonkohout Oct 15, 2024

google-oss-prow bot added the lgtm label Oct 15, 2024

google-oss-prow bot added the approved label Oct 15, 2024

google-oss-prow bot merged commit c2ad9e6 into kubeflow:master Oct 15, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): integrate Spark operator in Kubeflow manifests #2889

feat(spark): integrate Spark operator in Kubeflow manifests #2889

GezimSejdiu commented Oct 11, 2024

andreyvelich left a comment

andreyvelich commented Oct 11, 2024

juliusvonkohout commented Oct 14, 2024

juliusvonkohout commented Oct 14, 2024

GezimSejdiu commented Oct 14, 2024

juliusvonkohout commented Oct 14, 2024 •

edited

Loading

GezimSejdiu commented Oct 14, 2024

andreyvelich commented Oct 14, 2024

juliusvonkohout commented Oct 14, 2024 •

edited

Loading

juliusvonkohout commented Oct 14, 2024 •

edited

Loading

ImpSy Oct 15, 2024

GezimSejdiu Oct 15, 2024

juliusvonkohout Oct 15, 2024

GezimSejdiu commented Oct 15, 2024

juliusvonkohout commented Oct 15, 2024

google-oss-prow bot commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

juliusvonkohout commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

juliusvonkohout commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

vikas-saxena02 commented Oct 15, 2024

vikas-saxena02 commented Oct 15, 2024

feat(spark): integrate Spark operator in Kubeflow manifests #2889

feat(spark): integrate Spark operator in Kubeflow manifests #2889

Conversation

GezimSejdiu commented Oct 11, 2024

Key Changes:

Benefits:

How to Test:

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich commented Oct 11, 2024

juliusvonkohout commented Oct 14, 2024

juliusvonkohout commented Oct 14, 2024

GezimSejdiu commented Oct 14, 2024

juliusvonkohout commented Oct 14, 2024 • edited Loading

GezimSejdiu commented Oct 14, 2024

andreyvelich commented Oct 14, 2024

juliusvonkohout commented Oct 14, 2024 • edited Loading

juliusvonkohout commented Oct 14, 2024 • edited Loading

ImpSy Oct 15, 2024

Choose a reason for hiding this comment

GezimSejdiu Oct 15, 2024

Choose a reason for hiding this comment

juliusvonkohout Oct 15, 2024

Choose a reason for hiding this comment

GezimSejdiu commented Oct 15, 2024

juliusvonkohout commented Oct 15, 2024

google-oss-prow bot commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

juliusvonkohout commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

juliusvonkohout commented Oct 15, 2024

andreyvelich commented Oct 15, 2024

vikas-saxena02 commented Oct 15, 2024

vikas-saxena02 commented Oct 15, 2024

juliusvonkohout commented Oct 14, 2024 •

edited

Loading

juliusvonkohout commented Oct 14, 2024 •

edited

Loading

juliusvonkohout commented Oct 14, 2024 •

edited

Loading