-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(spark): integrate Spark operator in Kubeflow manifests #2889
Conversation
- Add Spark operator manifests for distributed Spark workloads. - Ensure integration with Kubeflow pipelines for seamless Spark job execution. Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for doing this @GezimSejdiu!
Given that Spark Operator is actually part of Kubeflow core components, we should install it as other Kubeflow apps: https://github.com/kubeflow/manifests/tree/master/apps.
/assign @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway
/cc @kubeflow/wg-manifests-leads |
@andreyvelich @GezimSejdiu is a colleague of mine ;-) Yes, we can add it to applications directly since you seem to be fine with it as owner of spark-operator https://github.com/kubeflow/spark-operator/blob/3acd0f1a900a933e8612c1b4af55d29b1112cbf1/OWNERS#L2 |
/ok-to-test |
Hey @andreyvelich , thank you. Good point. While thinking now, I still believe we can keep it into contrib in case, at some point we decide to merge efforts with Apache Spark Operator, but let us discuss that in another thread. @juliusvonkohout what do you think? Shall we move them into apps or still consider keeping (for now) under contrib. |
Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
I am fine with starting in /contrib. If it works well, we can move to /apps. Especially if someone from the spark team wants to maintain it long-term here. |
Fix some issues with tests, Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
I'm currently testing it locally and will be working on it to make it work, but it seems webhook isn't working properly, I get: 2024-10-14T11:17:20.496Z ERROR webhook/start.go:254 Failed to sync webhook secret {"error": "secrets \"spark-operator-webhook-certs\" is forbidden: User \"system:serviceaccount:kubeflow:spark-operator-webhook\" cannot get resource \"secrets\" in API group \"\" in the namespace \"default\""}
github.com/kubeflow/spark-operator/cmd/operator/webhook.start
/workspace/cmd/operator/webhook/start.go:254
github.com/kubeflow/spark-operator/cmd/operator/webhook.NewStartCommand.func2
/workspace/cmd/operator/webhook/start.go:128
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:989
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1117
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.8.1/command.go:1041
main.main
/workspace/cmd/main.go:27
runtime.main
/usr/local/go/src/runtime/proc.go:272
I think we will have to play with webhook configurations to make this work. @andreyvelich feel free to suggest any solution for this issue. |
Given that Kubeflow community currently maintain this operator, I think we should just move it to apps since it is part of Kubeflow core components as described here: https://www.kubeflow.org/docs/started/architecture/#kubeflow-ecosystem Any concerns with that @ChenYi015 @yuchaoran2011 @vara-bonthu @ImpSy @jacobsalway? |
You might need a networkpolicy as in https://github.com/kubeflow/manifests/blob/master/common/networkpolicies/base/training-operator-webhook.yaml for port 9443 https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23912 and i am wondering where the namespace "default" comes from. It should not be used in any way. |
I see for example https://github.com/kubeflow/manifests/blob/a6f16d41230b43a375a2f95478519303ae02eef8/contrib/spark/spark-operator/base/resources.yaml#L23191 and many other places with the namespace default. It needs to be adjusted to run in the namespace "kubeflow." Maybe you can just do so in the main kustomization.yaml with a single line. |
- op: add | ||
path: /spec/template/spec/containers/0/securityContext | ||
value: | ||
runAsUser: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator is based on spark image that uses 185 as non root user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Let me then remove this as we have the non-root user by default on Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just runAsNonRoot:true is enough then
Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
Indeed, that was the issue. I made it work by specifying the namespace while generating the resources.yaml and also removing the default namespace from spark.jobNamesapces (which is the default on the helm chart) to restrict to only that namespace. I made it work now, but only without webhook. When I enable webhook I get this: Error from server (InternalError): error when creating "sparkapplication_example.yaml": Internal error occurred: failed calling webhook "mutate-sparkoperator-k8s-io-v1beta2-sparkapplication.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook-svc.kubeflow.svc:9443/mutate-sparkoperator-k8s-io-v1beta2-sparkapplication?timeout=10s": dial tcp 10.96.184.97:9443: i/o timeout and when I set
I saw that you override the service e.g. https://github.com/kubeflow/manifests/blob/master/apps/tensorboard/tensorboard-controller/upstream/webhook/service.yaml and I did try to do the same, but it doesn't seem to work. Shall we consider disabling webhook or it is a must and we have to enforce such validation. If we can go ahead, I think this version is working already, I did test it locally. Let us also see / test CICD. |
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
34dcc45
to
45dcbd9
Compare
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
…ebhook Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: juliusvonkohout <45896133+juliusvonkohout@users.noreply.github.com>
/lgtm We might continue in follow up PRs. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: juliusvonkohout The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@GezimSejdiu @juliusvonkohout Can you create tracking issue to move Spark Operator into Kubeflow apps from contrib folder ? |
We can move it to /apps and add /contrib/apache-spark-operator for the non-kubeflow apache operator, similar to kserve. We want to evaluate whether it has advantages to the kubeflow one. |
What is the motivation to install apache-spark-operator, if right now Kubeflow community maintains Kubeflow Spark Operator and contributes features there ? |
I guess, right now we suggest Kubeflow users to use Kubeflow Spark Operator given the community support. |
Additionally, the Kubeflow Spark Operator will be included in Kubeflow 1.10 release which means it should be part of Kubeflow apps in the manifest repo. |
We are just not sure whether it also supports interactive spark sessions live from a jupyterlab and persistent clusters next to just running packaged spark applications. |
Yes, that is possible with Kubeflow Spark Operator. @yuchaoran2011 and @vara-bonthu can share how that could be done. Also, @vikas-saxena02 has proposal to integrate it in Kubeflow Notebooks: https://docs.google.com/document/d/1Uvg3ykF7kIySVTY68xPKMEfy-tK-p_gaeXgidQaLCD4/edit?tab=t.0#heading=h.9jiz1e25qlob |
@GezimSejdiu are you implementing the support with Kubeflow Pipelines v1 or v2 or both? |
@juliusvonkohout I have been working on implementing the interactive support but seems like this will be a big one and will need support for Spark Connect to be implemented in SparkOperator first. This may take sometime but is definitely work in progress. Also, interactive spark jobs work well with jupyter enterprise gateway or Apache Levy, the proposal doc has this updated. |
This PR introduces the integration of the Spark operator into the Kubeflow manifests, enabling users to run distributed Spark workloads directly within their Kubeflow environment. With this addition, Kubeflow users can leverage Spark for large-scale data processing as part of their machine learning (ML) pipelines, improving scalability and efficiency.
Key Changes:
Benefits:
How to Test: