Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable istio support for ray #2847

Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d042a00
enable istio support for the ray
hansinikarunarathne Aug 17, 2024
2931232
change the yaml file name
hansinikarunarathne Aug 17, 2024
bfafe2a
made changes in in raycluster test
hansinikarunarathne Aug 22, 2024
adc3aec
made a fix
hansinikarunarathne Aug 22, 2024
c16a65a
Changed the namespace default to kubeflow in the headless service of …
hansinikarunarathne Aug 22, 2024
08dee52
add additional step
hansinikarunarathne Aug 22, 2024
6d20065
enable istio in kubelow namesapce
hansinikarunarathne Aug 22, 2024
95b3728
Chnage namespace to install istio
hansinikarunarathne Aug 22, 2024
ca5e2e8
fix a issue
hansinikarunarathne Aug 22, 2024
eb75bab
fix a issue
hansinikarunarathne Aug 22, 2024
6f9c4fa
fix folder structure issue
hansinikarunarathne Aug 22, 2024
1be65fe
enable istion labele to kubeflow
hansinikarunarathne Aug 22, 2024
75be977
create kubelow namespace
hansinikarunarathne Aug 22, 2024
83b829d
add pod watch
hansinikarunarathne Aug 22, 2024
29ca6bd
remove watch
hansinikarunarathne Aug 22, 2024
d406028
Add istio authorization policy for ray_cluster
hansinikarunarathne Sep 8, 2024
aac18ed
fix uncommented changes
hansinikarunarathne Sep 8, 2024
32d02bc
made namespace change
hansinikarunarathne Sep 9, 2024
98a70d8
Added user namespace and multitenancy
hansinikarunarathne Sep 9, 2024
7b92d8d
fixed a issue with tets.sh
hansinikarunarathne Sep 9, 2024
847d453
Did the requested chnages
hansinikarunarathne Sep 14, 2024
6c99f5d
Changed the sa to default-editor
hansinikarunarathne Sep 15, 2024
270269a
add sa to deployment
hansinikarunarathne Sep 15, 2024
4f875a3
Update the raycluster Readme file
hansinikarunarathne Sep 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .github/workflows/ray_test.yaml
hansinikarunarathne marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Build & Apply Ray manifest in KinD
on:

Check warning on line 2 in .github/workflows/ray_test.yaml

View workflow job for this annotation

GitHub Actions / format_YAML_files

2:1 [truthy] truthy value should be one of [false, true]
pull_request:
paths:
- tests/gh-actions/install_KinD_create_KinD_cluster_install_kustomize.sh
Expand All @@ -16,7 +16,23 @@
- name: Install KinD, Create KinD cluster and Install kustomize
run: ./tests/gh-actions/install_KinD_create_KinD_cluster_install_kustomize.sh

- name: Install Istio with external authentication
run: ./tests/gh-actions/install_istio_with_ext_auth.sh

- name: Install cert-manager
run: ./tests/gh-actions/install_cert_manager.sh

- name: Create kubeflow namespace
run: kustomize build common/kubeflow-namespace/base | kubectl apply -f -

- name: Install KF Multi Tenancy
run: ./tests/gh-actions/install_multi_tenancy.sh

- name: Create KF Profile
run: kustomize build common/user-namespace/base | kubectl apply -f -

- name: Build & Apply manifests
run: |
cd contrib/ray/
export KF_PROFILE=kubeflow-user-example-com
make test
3 changes: 1 addition & 2 deletions contrib/ray/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,4 @@ kuberay-operator/base:

.PHONY: test
test:
./test.sh

./test.sh ${KF_PROFILE}
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
namespace: kubeflow
resources:
- ../../base
- namespace.yaml
- ../../base
hansinikarunarathne marked this conversation as resolved.
Outdated
Show resolved Hide resolved

This file was deleted.

211 changes: 115 additions & 96 deletions contrib/ray/raycluster_example.yaml
hansinikarunarathne marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,60 +1,96 @@
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: allow-ray-workers-head
spec:
action: ALLOW
rules:
- from:
- source:
principals:
- "cluster.local/ns/kubeflow-user-example-com/sa/default-editor"
- to:
- operation:
ports:
- "6379"
hansinikarunarathne marked this conversation as resolved.
Show resolved Hide resolved
- "6380"
- "6381"
- "6382"
- "6383"
- "52365"
- "8080"
- "10012"
---
apiVersion: v1
kind: Service
metadata:
labels:
ray.io/headless-worker-svc: raycluster-istio
name: raycluster-istio-headless-svc
spec:
clusterIP: None
selector:
ray.io/cluster: kubeflow-raycluster
publishNotReadyAddresses: true
ports:
- name: node-manager-port
port: 6380
appProtocol: grpc
- name: object-manager-port
port: 6381
appProtocol: grpc
- name: runtime-env-agent-port
port: 6382
appProtocol: grpc
- name: dashboard-agent-grpc-port
port: 6383
appProtocol: grpc
- name: dashboard-agent-listen-port
port: 52365
appProtocol: http
- name: metrics-export-port
port: 8080
appProtocol: http
- name: max-worker-port
port: 10012
appProtocol: grpc
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: kubeflow-raycluster
spec:
rayVersion: '2.23.0'
# If `enableInTreeAutoscaling` is true, the Autoscaler sidecar will be added to the Ray head pod.
enableInTreeAutoscaling: true
# `autoscalerOptions` is an OPTIONAL field specifying configuration overrides for the Ray Autoscaler.
# The example configuration shown below below represents the DEFAULT values.
# (You may delete autoscalerOptions if the defaults are suitable.)
autoscalerOptions:
# Default: Upscaling is not rate-limited. This mode adds new worker pods to handle increased workload as quiclky as possible.
upscalingMode: Default
# `idleTimeoutSeconds` is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
idleTimeoutSeconds: 60
# Ray head pod configuration
headGroupSpec:
# Kubernetes Service Type.
serviceType: ClusterIP
# The following params are used to complete the ray start: ray start --head --block --dashboard-host: '0.0.0.0' ...
rayStartParams:
# Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
num-cpus: "0"
dashboard-host: '0.0.0.0'
block: 'true'
# pod template
num-cpus: '1'
node-manager-port: '6380'
object-manager-port: '6381'
runtime-env-agent-port: '6382'
dashboard-agent-grpc-port: '6383'
dashboard-agent-listen-port: '52365'
metrics-export-port: '8080'
max-worker-port: '10012'
node-ip-address: $(hostname -I | tr -d ' ' | sed 's/\./-/g').raycluster-istio-headless-svc.kubeflow-user-example-com.svc.cluster.local
template:
metadata:
# Custom labels. NOTE: To avoid conflicts with KubeRay operator, do not define custom labels start with `raycluster`.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
# The ray head must not have an Istio sidecar
# TODO add an authorizationpolicy in the future for the ray head
labels:
sidecar.istio.io/inject: "false"
sidecar.istio.io/inject: "true"
spec:
containers:
- name: ray-head
image: rayproject/ray:2.23.0-py311-cpu
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# The resource requests and limits in this config are too small for production!
# It is better to use a few large Ray pod than many small ones.
# For production, it is ideal to size each Ray pod to take up the
# entire Kubernetes node on which it is scheduled.
resources:
limits:
cpu: "1"
Expand All @@ -73,68 +109,51 @@ spec:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
# logical group name, for this called small-group, also can be functional
groupName: small-group
rayStartParams:
block: 'true'
#pod template
template:
metadata:
labels:
# Disable the sidecars for the ray wokers
# TODO add an authorizationpolicy in the future for the ray worker
sidecar.istio.io/inject: "false"
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.23.0-py311-cpu
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
# The resource requests and limits in this config are too small for production!
# It is better to use a few large Ray pod than many small ones.
# For production, it is ideal to size each Ray pod to take up the
# entire Kubernetes node on which it is scheduled.
resources:
limits:
cpu: "1"
memory: "1G"
requests:
cpu: "300m"
memory: "1G"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init
image: busybox:1.36
# Change the cluster postfix if you don't have a default setting
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]
securityContext:
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
# use volumes
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumes:
- name: ray-logs
emptyDir: {}
- replicas: 1
minReplicas: 1
maxReplicas: 1
groupName: small-group
rayStartParams:
num-cpus: '1'
node-manager-port: '6380'
object-manager-port: '6381'
runtime-env-agent-port: '6382'
dashboard-agent-grpc-port: '6383'
dashboard-agent-listen-port: '52365'
metrics-export-port: '8080'
max-worker-port: '10012'
node-ip-address: $(hostname -I | tr -d ' ' | sed 's/\./-/g').raycluster-istio-headless-svc.kubeflow-user-example-com.svc.cluster.local
template:
metadata:
labels:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.23.0-py311-cpu
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
# use volumeMounts.Optional.
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
memory: "1G"
requests:
cpu: "300m"
memory: "1G"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
volumes:
- name: ray-logs
emptyDir: {}
hansinikarunarathne marked this conversation as resolved.
Show resolved Hide resolved
83 changes: 52 additions & 31 deletions contrib/ray/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,46 +2,44 @@

set -euxo

NAMESPACE=kubeflow
NAMESPACE=$1
TIMEOUT=120 # timeout in seconds
SLEEP_INTERVAL=30 # interval between checks in seconds
RAY_VERSION=2.23.0

function trap_handler {
kill $PID
# Delete RayCluster
kubectl -n $NAMESPACE delete -f raycluster_example.yaml

# Wait for all Ray Pods to be deleted.
start_time=$(date +%s)
while true; do
pods=$(kubectl -n $NAMESPACE get pods -o json | jq '.items | length')
if [ "$pods" -eq 1 ]; then
break
fi
current_time=$(date +%s)
elapsed_time=$((current_time - start_time))
if [ "$elapsed_time" -ge "$TIMEOUT" ]; then
echo "Timeout exceeded. Exiting loop."
exit 1
fi
sleep $SLEEP_INTERVAL
done

# Delete KubeRay operator
kustomize build kuberay-operator/base | kubectl -n $NAMESPACE delete -f -
}

trap trap_handler EXIT
start_time=$(date +%s)
for ((i=0; i<TIMEOUT; i+=2)); do
if [[ $(kubectl get namespace $NAMESPACE --no-headers 2>/dev/null | wc -l) -eq 1 ]]; then
echo "Namespace $NAMESPACE created."
break
fi

current_time=$(date +%s)
elapsed_time=$((current_time - start_time))

if [ "$elapsed_time" -ge "$TIMEOUT" ]; then
echo "Timeout exceeded. Namespace $NAMESPACE not created."
exit 1
fi

echo "Waiting for namespace $NAMESPACE to be created..."
sleep 2
done

echo "Namespace $NAMESPACE has been created!"

kubectl label namespace $NAMESPACE istio-injection=enabled

kubectl get namespaces --selector=istio-injection=enabled
hansinikarunarathne marked this conversation as resolved.
Show resolved Hide resolved

# Install KubeRay operator
kustomize build kuberay-operator/overlays/standalone | kubectl -n $NAMESPACE apply --server-side -f -
kustomize build kuberay-operator/overlays/standalone | kubectl -n kubeflow apply --server-side -f -

# Wait for the operator to be ready.
kubectl -n $NAMESPACE wait --for=condition=available --timeout=600s deploy/kuberay-operator
kubectl -n $NAMESPACE get pod -l app.kubernetes.io/component=kuberay-operator
kubectl -n kubeflow wait --for=condition=available --timeout=600s deploy/kuberay-operator
kubectl -n kubeflow get pod -l app.kubernetes.io/component=kuberay-operator

# Create a RayCluster custom resource.
# Install RayCluster components
kubectl -n $NAMESPACE apply -f raycluster_example.yaml

# Wait for the RayCluster to be ready.
Expand All @@ -67,3 +65,26 @@ else
echo "Test failed!"
exit 1
fi

# Delete RayCluster
kubectl -n $NAMESPACE delete -f raycluster_example.yaml

# Wait for all Ray Pods to be deleted.
start_time=$(date +%s)
for ((i=0; i<TIMEOUT; i+=SLEEP_INTERVAL)); do
pods=$(kubectl -n $NAMESPACE get pods -o json | jq '.items | length')
if [ "$pods" -eq 0 ]; then
kill $PID
break
fi
current_time=$(date +%s)
elapsed_time=$((current_time - start_time))
if [ "$elapsed_time" -ge "$TIMEOUT" ]; then
echo "Timeout exceeded. Exiting loop."
exit 1
fi
sleep $SLEEP_INTERVAL
done

# Delete KubeRay operator
kustomize build kuberay-operator/base | kubectl -n kubeflow delete -f -
Loading