Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spark): integrate Spark operator in Kubeflow manifests #2889

Merged
merged 21 commits into from
Oct 15, 2024

Conversation

GezimSejdiu
Copy link
Contributor

This PR introduces the integration of the Spark operator into the Kubeflow manifests, enabling users to run distributed Spark workloads directly within their Kubeflow environment. With this addition, Kubeflow users can leverage Spark for large-scale data processing as part of their machine learning (ML) pipelines, improving scalability and efficiency.

Key Changes:

  • Added Spark operator manifests to support Spark job execution.
  • Ensured integration of the Spark operator with Kubeflow pipelines, allowing users to submit and monitor Spark jobs within the Kubeflow UI.
  • Updated the documentation to include instructions on how to deploy and use the Spark operator within Kubeflow.
  • Configured role-based access control (RBAC) to secure Spark job submissions and resource usage.

Benefits:

  • Enables distributed data processing using Spark directly within Kubeflow.
  • Simplifies the management and orchestration of Spark jobs as part of ML workflows.
  • Provides an easy-to-use interface to monitor and manage Spark workloads.

How to Test:

  1. Deploy the updated Kubeflow manifests, including the Spark operator.
  2. Submit a sample Spark job via the Kubeflow pipeline UI.
  3. Verify the job execution, logs, and results within the Spark UI and Kubeflow pipeline dashboard.

- Add Spark operator manifests for distributed Spark workloads.
- Ensure integration with Kubeflow pipelines for seamless Spark job execution.

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for doing this @GezimSejdiu!
Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.

/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway

@andreyvelich
Copy link
Member

/cc @kubeflow/wg-manifests-leads

@google-oss-prow google-oss-prow bot requested a review from a team October 11, 2024 15:09
@juliusvonkohout
Copy link
Member

@andreyvelich @GezimSejdiu is a colleague of mine ;-) Yes, we can add it to applications directly since you seem to be fine with it as owner of spark-operator https://github.com/kubeflow/spark-operator/blob/3acd0f1a900a933e8612c1b4af55d29b1112cbf1/OWNERS#L2

@juliusvonkohout
Copy link
Member

/ok-to-test

@GezimSejdiu
Copy link
Contributor Author

Thank you so much for doing this @GezimSejdiu! Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.

/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway

Hey @andreyvelich ,

thank you. Good point. While thinking now, I still believe we can keep it into contrib in case, at some point we decide to merge efforts with Apache Spark Operator, but let us discuss that in another thread. @juliusvonkohout what do you think? Shall we move them into apps or still consider keeping (for now) under contrib.

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
@juliusvonkohout
Copy link
Member

juliusvonkohout commented Oct 14, 2024

I am fine with starting in /contrib. If it works well, we can move to /apps. Especially if someone from the spark team wants to maintain it long-term here.

Fix some issues with tests,

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
@GezimSejdiu
Copy link
Contributor Author

I'm currently testing it locally and will be working on it to make it work, but it seems webhook isn't working properly, I get:

2024-10-14T11:17:20.496Z	ERROR	webhook/start.go:254	Failed to sync webhook secret	{"error": "secrets \"spark-operator-webhook-certs\" is forbidden: User \"system:serviceaccount:kubeflow:spark-operator-webhook\" cannot get resource \"secrets\" in API group \"\" in the namespace \"default\""}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
	/workspace/cmd/operator/webhook/start.go:254
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
	/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
	/workspace/cmd/main.go:27
runtime.main
	/usr/local/go/src/runtime/proc.go:272

I think we will have to play with webhook configurations to make this work.

@andreyvelich feel free to suggest any solution for this issue.

@andreyvelich
Copy link
Member

Thank you so much for doing this @GezimSejdiu! Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.
/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway

Hey @andreyvelich ,

thank you. Good point. While thinking now, I still believe we can keep it into contrib in case, at some point we decide to merge efforts with Apache Spark Operator, but let us discuss that in another thread. @juliusvonkohout what do you think? Shall we move them into apps or still consider keeping (for now) under contrib.

Given that Kubeflow community currently maintain this operator, I think we should just move it to apps since it is part of Kubeflow core components as described here: https://www.kubeflow.org/docs/started/architecture/#kubeflow-ecosystem
Furthermore, we have plans to release Kubeflow Spark Operator as part of Kubeflow 1.10 release.

Any concerns with that @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway?

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Oct 14, 2024

I'm currently testing it locally and will be working on it to make it work, but it seems webhook isn't working properly, I get:

2024-10-14T11:17:20.496Z	ERROR	webhook/start.go:254	Failed to sync webhook secret	{"error": "secrets \"spark-operator-webhook-certs\" is forbidden: User \"system:serviceaccount:kubeflow:spark-operator-webhook\" cannot get resource \"secrets\" in API group \"\" in the namespace \"default\""}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
	/workspace/cmd/operator/webhook/start.go:254
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
	/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
	/workspace/cmd/main.go:27
runtime.main
	/usr/local/go/src/runtime/proc.go:272

I think we will have to play with webhook configurations to make this work.

@andreyvelich feel free to suggest any solution for this issue.

You might need a networkpolicy as in https://github.com/kubeflow/manifests/blob/master/common/networkpolicies/base/training-operator-webhook.yaml for port 9443 https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23912 and i am wondering where the namespace "default" comes from. It should not be used in any way.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Oct 14, 2024

I see for example https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23191 and many other places with the namespace default. It needs to be adjusted to run in the namespace "kubeflow." Maybe you can just do so in the main kustomization.yaml with a single line.

- op: add
path: /spec/template/spec/containers/0/securityContext
value:
runAsUser: 1000
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is based on spark image that uses 185 as non root user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let me then remove this as we have the non-root user by default on Spark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just runAsNonRoot:true is enough then

Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
@GezimSejdiu
Copy link
Contributor Author

I see for example https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23191 and many other places with the namespace default. It needs to be adjusted to run in the namespace "kubeflow." Maybe you can just do so in the main kustomization.yaml with a single line.

Indeed, that was the issue. I made it work by specifying the namespace while generating the resources.yaml and also removing the default namespace from spark.jobNamesapces (which is the default on the helm chart) to restrict to only that namespace.

I made it work now, but only without webhook. When I enable webhook I get this:

Error from server (InternalError): error when creating "sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:9443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.184.97:9443: i/o timeout

and when I set webhook.port=443 it gives me:

Error from server (InternalError): error when creating "sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.125.227:443: connect: connection refused

I saw that you override the service e.g. https://github.com/kubeflow/manifests/blob/master/apps/tensorboard/tensorboard-controller/upstream/webhook/service.yaml and I did try to do the same, but it doesn't seem to work.

Shall we consider disabling webhook or it is a must and we have to enforce such validation. If we can go ahead, I think this version is working already, I did test it locally. Let us also see / test CICD.

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
…ebhook

Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
@juliusvonkohout
Copy link
Member

/lgtm
/approve

We might continue in follow up PRs.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit c2ad9e6 into kubeflow:master Oct 15, 2024
9 checks passed
@andreyvelich
Copy link
Member

@GezimSejdiu @juliusvonkohout Can you create tracking issue to move Spark Operator into Kubeflow apps from contrib folder ?

@juliusvonkohout
Copy link
Member

We can move it to /apps

and add /contrib/apache-spark-operator for the non-kubeflow apache operator, similar to kserve. We want to evaluate whether it has advantages to the kubeflow one.

@andreyvelich
Copy link
Member

What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ?

@andreyvelich
Copy link
Member

I guess, right now we suggest Kubeflow users to use Kubeflow Spark Operator given the community support.

@andreyvelich
Copy link
Member

Additionally, the Kubeflow Spark Operator will be included in Kubeflow 1.10 release which means it should be part of Kubeflow apps in the manifest repo.
cc @rimolive

@juliusvonkohout
Copy link
Member

What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ?

We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications.

@andreyvelich
Copy link
Member

We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications.

Yes, that is possible with Kubeflow Spark Operator. @yuchaoran2011 and @vara-bonthu can share how that could be done.
We also have some open issue from end-user about it: kubeflow/spark-operator#2180

Also, @vikas-saxena02 has proposal to integrate it in Kubeflow Notebooks: https://docs.google.com/document/d/1Uvg3ykF7kIySVTY68xPKMEfy-tK-p_gaeXgidQaLCD4/edit?tab=t.0#heading=h.9jiz1e25qlob

@vikas-saxena02
Copy link

@GezimSejdiu are you implementing the support with Kubeflow Pipelines v1 or v2 or both?

@vikas-saxena02
Copy link

What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ?

We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications.

@juliusvonkohout I have been working on implementing the interactive support but seems like this will be a big one and will need support for Spark Connect to be implemented in SparkOperator first. This may take sometime but is definitely work in progress. Also, interactive spark jobs work well with jupyter enterprise gateway or Apache Levy, the proposal doc has this updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants