diff --git a/applications/ray/rayservice-examples/README.md b/applications/ray/rayservice-examples/README.md deleted file mode 100644 index 3b3134d29..000000000 --- a/applications/ray/rayservice-examples/README.md +++ /dev/null @@ -1,5 +0,0 @@ -# Serve meta llama2 7b, llama2 70b quantized, falcon 7b-instruct or falcon 40b-instruct quantized models - -The examples have moved to [kubernetes-engine-samples](https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/ai-ml/gke-ray/rayserve) - -The documentation associated with the examples can be found on [Google Cloud Documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/serve-llm-l4-ray) \ No newline at end of file diff --git a/applications/ray/raytrain-examples/images/ray-cluster-on-gke.png b/applications/ray/raytrain-examples/images/ray-cluster-on-gke.png deleted file mode 100644 index bfc78b12c..000000000 Binary files a/applications/ray/raytrain-examples/images/ray-cluster-on-gke.png and /dev/null differ diff --git a/applications/ray/raytrain-examples/images/ray-head-resources.png b/applications/ray/raytrain-examples/images/ray-head-resources.png deleted file mode 100644 index 89ad3e1ee..000000000 Binary files a/applications/ray/raytrain-examples/images/ray-head-resources.png and /dev/null differ diff --git a/applications/ray/raytrain-examples/images/ray-worker-resources.png b/applications/ray/raytrain-examples/images/ray-worker-resources.png deleted file mode 100644 index 6fe4785bb..000000000 Binary files a/applications/ray/raytrain-examples/images/ray-worker-resources.png and /dev/null differ diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/README.md b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/README.md deleted file mode 100644 index 7e2a8a9f5..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/README.md +++ /dev/null @@ -1,102 +0,0 @@ -**Goal** - -In this example we will demonstrate how to setup a ray cluster on GKE and deploy a distributed training job to fine tuning a stable diffusion model following the example from https://docs.ray.io/en/latest/train/examples/pytorch/dreambooth_finetuning.html and artifacts in https://github.com/ray-project/ray/tree/master/doc/source/templates/05_dreambooth_finetuning - -We will deploy a jupyter pod and a ray cluster (using kuberay operator). The pods will mount to shared filesystem (GCS Fuse CSI in this specific example) where the model and the datasets live and readily accessible to ray worker pods during training and inference. Ray jobs will be triggered from the jupyter notebook running in the jupyter pod. The example showcases [ray data API](https://docs.ray.io/en/latest/data/api/api.html) usage with a [GKE GCS Fuse CSI](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver) mounted volumes - -![ray-cluster](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/raytrain-examples/images/ray-cluster-on-gke.png) - -**Setup Steps** - -1. Create a GKE cluster with GPU node pool of 4 nodes (1 GPU per GKE node. In this example we used the n1-standard-32 machine type with [T4 GPU](https://cloud.google.com/compute/docs/gpus#nvidia_t4_gpus)). Ensure that Workload Identity and GCS CSI driver is enabled for the cluster. See details [here](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#authentication) and [here](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#enable) -``` -$ gcloud container clusters create $CLUSTER_NAME --location us-central1-c --workload-pool $PROJECT_ID.svc.id.goog --cluster-version=1.27 --num-nodes=1 --machine-type=e2-standard-32 --addons GcsFuseCsiDriver --enable-ip-alias -$ gcloud container node-pools create gpu-pool --cluster $CLUSTER_NAME --machine-type n1-standard-32 --accelerator type=nvidia-tesla-t4,count=1 --num-nodes=4 -``` -2. Ensure that the nvidia driver plugins are installed as expected (If not follow the steps [here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers)) -``` -$ kubectl get po -n kube-system | grep nvidia -$ k get po -n kube-system | grep nvidia -nvidia-gpu-device-plugin-medium-cos-c5j8b 1/1 Running 0 8m24s -nvidia-gpu-device-plugin-medium-cos-kpmlr 1/1 Running 0 7m54s -nvidia-gpu-device-plugin-medium-cos-q844w 1/1 Running 0 8m25s -nvidia-gpu-device-plugin-medium-cos-t4q2x 1/1 Running 0 7m17s -``` - -3. Create a namespace `example` ```kubectl create ns example``` -4. Change context to the current namespace -``` -kubectl config set-context --current --namespace example -``` -5. Install the kuberay operator and validate operator pod is Running in `example` namespace -``` - helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0-rc.0 --values 10001/TCP,8265/TCP,8080/TCP,6379/TCP,8000/TCP 3m12s -service/kuberay-operator ClusterIP 10.8.14.245 8080/TCP 4m4s -service/tensorflow ClusterIP None 8888/TCP 16s -service/tensorflow-jupyter LoadBalancer 10.8.3.9 80:31891/TCP 16s - -NAME READY UP-TO-DATE AVAILABLE AGE -deployment.apps/kuberay-operator 1/1 1 1 4m4s - -NAME DESIRED CURRENT READY AGE -replicaset.apps/kuberay-operator-64b7b88759 1 1 1 4m4s - -NAME READY AGE -statefulset.apps/tensorflow 1/1 16s - -``` -9. Locate the service IP of the jupyter -``` -$ kubectl get service tensorflow-jupyter -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -tensorflow-jupyter LoadBalancer 10.8.14.182 35.188.214.7 80:31524/TCP 5m33s -``` -10. fetch the token for the login -``` -$ kubectl exec --tty -i tensorflow-0 -c tensorflow-container -n example -- jupyter server list -Currently running servers: -http://tensorflow-0:8888/?token= :: /home/jovyan -``` -11. Open a new notebook and import the notebook from the URL `https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/ray-on-gke/example_notebooks/raytrain-stablediffusion.ipynb` ([notebook](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/example_notebooks/raytrain-stablediffusion.ipynb)) - -12. Follow the comments and execute the cells in the notebook to run a distributed training job and then inference on the tuned model -13. Port forward the ray service port to examine the ray dashboard for jobs progress details, The dashboard is reachable at localhost:8286 in the local browser -``` -kubectl port-forward -n example service/ray-cluster-kuberay-head-svc 8265:8265 -``` -14. During an ongoing traing, the pod resource usage of CPU, Memory, GPU, GPU Memory can be visualized with the GKE Cloud Console for the workloads -example ![Ray Head resources](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/raytrain-examples/images/ray-head-resources.png) and ![Ray Worker resources](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/raytrain-examples/images/ray-worker-resources.png) diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/jupyter-spec.yaml b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/jupyter-spec.yaml deleted file mode 100644 index 5c3b858eb..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/jupyter-spec.yaml +++ /dev/null @@ -1,94 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Tensorflow/Jupyter StatefulSet -apiVersion: apps/v1 -kind: StatefulSet -metadata: - name: tensorflow - namespace: example -spec: - selector: - matchLabels: - pod: tensorflow-pod - serviceName: tensorflow - replicas: 1 - template: - metadata: - annotations: - gke-gcsfuse/volumes: "true" - gke-gcsfuse/cpu-limit: 500m - gke-gcsfuse/memory-limit: 10Gi - gke-gcsfuse/ephemeral-storage-limit: 30Gi - labels: - pod: tensorflow-pod - spec: - serviceAccountName: my-ksa - terminationGracePeriodSeconds: 30 - containers: - - name: tensorflow-container - securityContext: - privileged: true - image: jupyter/tensorflow-notebook:python-3.10 - volumeMounts: - - name: test-vol - mountPath: /persist-data - resources: - limits: - cpu: "10" - ephemeral-storage: 50Gi - memory: 50Gi - requests: - cpu: "2" - ephemeral-storage: 10Gi - memory: 10Gi - volumes: - - name: test-vol - csi: - driver: gcsfuse.csi.storage.gke.io - volumeAttributes: - bucketName: test-gcsfuse-1 - mountOptions: "implicit-dirs,uid=1000,gid=100" -## Optional: override and set your own token -# env: -# - name: JUPYTER_TOKEN -# value: "jupyter" ---- -# Headless service for the above StatefulSet -apiVersion: v1 -kind: Service -metadata: - name: tensorflow - namespace: example -spec: - ports: - - port: 8888 - clusterIP: None - selector: - pod: tensorflow-pod ---- -# External service -apiVersion: "v1" -kind: "Service" -metadata: - name: tensorflow-jupyter - namespace: example -spec: - ports: - - protocol: "TCP" - port: 80 - targetPort: 8888 - selector: - pod: tensorflow-pod - type: LoadBalancer diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberay-operator/values.yaml b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberay-operator/values.yaml deleted file mode 100644 index ab79c2c57..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberay-operator/values.yaml +++ /dev/null @@ -1,21 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Environment variables -env: -# If not set or set to true, kuberay auto injects an init container waiting for ray GCS. -# If false, you will need to inject your own init container to ensure ray GCS is up before the ray workers start. -# NOTE: This has been explicitly the init container mounts all volumes from the ray spec. This would fail because GCS Fuse based volume mounts are not supported with init containers (https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/38). Improvements will de done to GCS CSI as part of k8s sidecar feature (https://github.com/kubernetes/kubernetes/pull/116429). - - name: ENABLE_INIT_CONTAINER_INJECTION - value: "false" diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/main.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/main.tf deleted file mode 100644 index b32c1df3b..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/main.tf +++ /dev/null @@ -1,55 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -data "google_client_config" "provider" {} - -provider "kubernetes" { - config_path = pathexpand("~/.kube/config") -} - -provider "kubectl" { - config_path = pathexpand("~/.kube/config") -} - -provider "helm" { - kubernetes { - config_path = pathexpand("~/.kube/config") - } -} - -# module "kubernetes" { -# source = "./modules/kubernetes" - -# namespace = var.namespace -# } - -module "service_accounts" { - source = "./modules/service_accounts" - - # depends_on = [module.kubernetes] - project_id = var.project_id - namespace = var.namespace - service_account = var.service_account - gcs_bucket = var.gcs_bucket -} - -module "kuberay" { - source = "./modules/kuberay" - - depends_on = [ - # module.kubernetes, - module.service_accounts - ] - namespace = var.namespace -} \ No newline at end of file diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay-values.yaml b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay-values.yaml deleted file mode 100644 index 9c6609694..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay-values.yaml +++ /dev/null @@ -1,256 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# Default values for ray-cluster. -# This is a YAML-formatted file. -# Declare variables to be passed into your templates. - -# The KubeRay community welcomes PRs to expose additional configuration -# in this Helm chart. - -image: - # Replace this with your own image if needed. - repository: anyscale/ray - tag: 2.9.3-py310-cu118 - pullPolicy: IfNotPresent - -nameOverride: "kuberay" -fullnameOverride: "" - -imagePullSecrets: [] - # - name: an-existing-secret - -head: - groupName: headgroup - # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod. - # Ray autoscaler integration is supported only for Ray versions >= 1.11.0 - # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0. - # enableInTreeAutoscaling: true - # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler. - # The example configuration shown below below represents the DEFAULT values. - # autoscalerOptions: - # upscalingMode: Default - # idleTimeoutSeconds: 60 - # securityContext: {} - # env: [] - # envFrom: [] - # resources specifies optional resource request and limit overrides for the autoscaler container. - # For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required. - # resources: - # limits: - # cpu: "500m" - # memory: "512Mi" - # requests: - # cpu: "500m" - # memory: "512Mi" - labels: - cloud.google.com/gke-ray-node-type: head - created-by: ray-on-gke - serviceAccountName: my-ksa - rayStartParams: - dashboard-host: '0.0.0.0' - block: 'true' - # containerEnv specifies environment variables for the Ray container, - # Follows standard K8s container env schema. - containerEnv: - # - name: EXAMPLE_ENV - # value: "1" - - name: RAY_memory_monitor_refresh_ms - value: "0" - envFrom: [] - # - secretRef: - # name: my-env-secret - # ports optionally allows specifying ports for the Ray container. - # ports: [] - # resource requests and limits for the Ray head container. - # Modify as needed for your application. - # Note that the resources in this example are much too small for production; - # we don't recommend allocating less than 8G memory for a Ray pod in production. - # Ray pods should be sized to take up entire K8s nodes when possible. - # Always set CPU and memory limits for Ray pods. - # It is usually best to set requests equal to limits. - # See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources - # for further guidance. - resources: - limits: - cpu: "5" - nvidia.com/gpu: "1" - # To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head. - memory: "50G" - ephemeral-storage: 40Gi - requests: - cpu: "2" - nvidia.com/gpu: "1" - memory: "20G" - ephemeral-storage: 20Gi - annotations: - gke-gcsfuse/volumes: "true" - gke-gcsfuse/cpu-limit: "2" - gke-gcsfuse/memory-limit: 20Gi - gke-gcsfuse/ephemeral-storage-limit: 20Gi - nodeSelector: - iam.gke.io/gke-metadata-server-enabled: "true" - cloud.google.com/gke-accelerator: "nvidia-tesla-t4" - tolerations: [] - affinity: {} - # Ray container security context. - securityContext: {} - volumes: - - name: ray-logs - emptyDir: {} - - name: gcs-fuse-csi-ephemeral - csi: - driver: gcsfuse.csi.storage.gke.io - #readOnly: true - volumeAttributes: - bucketName: test-gcsfuse-1 - mountOptions: "implicit-dirs,uid=1000,gid=100" - # Ray writes logs to /tmp/ray/session_latests/logs - volumeMounts: - - mountPath: /tmp/ray - name: ray-logs - - name: gcs-fuse-csi-ephemeral - mountPath: /data -worker: - # If you want to disable the default workergroup - # uncomment the line below - # disabled: true - groupName: workergroup - replicas: 3 - type: worker - labels: - cloud.google.com/gke-ray-node-type: worker - created-by: ray-on-gke - serviceAccountName: my-ksa - rayStartParams: - block: 'true' - # containerEnv specifies environment variables for the Ray container, - # Follows standard K8s container env schema. - containerEnv: - # - name: EXAMPLE_ENV - # value: "1" - envFrom: [] - # - secretRef: - # name: my-env-secret - # ports optionally allows specifying ports for the Ray container. - # ports: [] - # resource requests and limits for the Ray head container. - # Modify as needed for your application. - # Note that the resources in this example are much too small for production; - # we don't recommend allocating less than 8G memory for a Ray pod in production. - # Ray pods should be sized to take up entire K8s nodes when possible. - # Always set CPU and memory limits for Ray pods. - # It is usually best to set requests equal to limits. - # See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources - # for further guidance. - resources: - limits: - cpu: "4" - nvidia.com/gpu: "1" - memory: "40G" - ephemeral-storage: 40Gi - requests: - cpu: "2" - nvidia.com/gpu: "1" - memory: "20G" - ephemeral-storage: 20Gi - annotations: - gke-gcsfuse/volumes: "true" - gke-gcsfuse/cpu-limit: "2" - gke-gcsfuse/memory-limit: 20Gi - gke-gcsfuse/ephemeral-storage-limit: 20Gi - nodeSelector: - iam.gke.io/gke-metadata-server-enabled: "true" - cloud.google.com/gke-accelerator: "nvidia-tesla-t4" - tolerations: [] - affinity: {} - # Ray container security context. - securityContext: {} - volumes: - - name: ray-logs - emptyDir: {} - - name: gcs-fuse-csi-ephemeral - csi: - driver: gcsfuse.csi.storage.gke.io - #readOnly: true - volumeAttributes: - bucketName: test-gcsfuse-1 - mountOptions: "implicit-dirs,uid=1000,gid=100" - # Ray writes logs to /tmp/ray/session_latests/logs - volumeMounts: - - mountPath: /tmp/ray - name: ray-logs - - name: gcs-fuse-csi-ephemeral - mountPath: /data -# The map's key is used as the groupName. -# For example, key:small-group in the map below -# will be used as the groupName -additionalWorkerGroups: - smallGroup: - # Disabled by default - disabled: true - replicas: 1 - minReplicas: 1 - maxReplicas: 3 - type: worker - labels: {} - rayStartParams: - block: 'true' - initContainerImage: 'busybox:1.28' # Enable users to specify the image for init container. Users can pull the busybox image from their private repositories. - # Security context for the init container. - initContainerSecurityContext: {} - # containerEnv specifies environment variables for the Ray container, - # Follows standard K8s container env schema. - containerEnv: [] - # - name: EXAMPLE_ENV - # value: "1" - envFrom: [] - # - secretRef: - # name: my-env-secret - # ports optionally allows specifying ports for the Ray container. - # ports: [] - # resource requests and limits for the Ray head container. - # Modify as needed for your application. - # Note that the resources in this example are much too small for production; - # we don't recommend allocating less than 8G memory for a Ray pod in production. - # Ray pods should be sized to take up entire K8s nodes when possible. - # Always set CPU and memory limits for Ray pods. - # It is usually best to set requests equal to limits. - # See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources - # for further guidance. - resources: - limits: - cpu: 1 - memory: "1G" - requests: - cpu: 1 - memory: "1G" - annotations: - key: value - nodeSelector: {} - tolerations: [] - affinity: {} - # Ray container security context. - securityContext: {} - volumes: - - name: log-volume - emptyDir: {} - # Ray writes logs to /tmp/ray/session_latests/logs - volumeMounts: - - mountPath: /tmp/ray - name: log-volume - sidecarContainers: [] - -service: - type: ClusterIP diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay.tf deleted file mode 100644 index dce76229d..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay.tf +++ /dev/null @@ -1,23 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -resource "helm_release" "ray-cluster" { - name = "ray-cluster" - repository = "https://ray-project.github.io/kuberay-helm/" - chart = "ray-cluster" - namespace = var.namespace - values = [ - file("${path.module}/kuberay-values.yaml") - ] -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/variables.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/variables.tf deleted file mode 100644 index 048860d76..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/variables.tf +++ /dev/null @@ -1,19 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -variable "namespace" { - type = string - description = "Kubernetes namespace where resources are deployed" - default = "example" -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/versions.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/versions.tf deleted file mode 100644 index 72eb5eeb0..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/versions.tf +++ /dev/null @@ -1,33 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -terraform { - required_providers { - helm = { - source = "hashicorp/helm" - version = "~> 2.8.0" - } - kubernetes = { - source = "hashicorp/kubernetes" - version = "2.18.1" - } - kubectl = { - source = "alekc/kubectl" - version = "2.0.1" - } - } - provider_meta "google" { - module_name = "blueprints/terraform/terraform-google-kubernetes-engine:kuberay/v0.1.0" - } -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/kubernetes.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/kubernetes.tf deleted file mode 100644 index c9681bfc5..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/kubernetes.tf +++ /dev/null @@ -1,19 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -resource "kubernetes_namespace" "ml" { - metadata { - name = var.namespace - } -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/variables.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/variables.tf deleted file mode 100644 index 93941f48e..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/variables.tf +++ /dev/null @@ -1,19 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -variable "namespace" { - type = string - description = "Kubernetes namespace where resources are deployed" - default = "ml-system" -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/versions.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/versions.tf deleted file mode 100644 index 72eb5eeb0..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kubernetes/versions.tf +++ /dev/null @@ -1,33 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -terraform { - required_providers { - helm = { - source = "hashicorp/helm" - version = "~> 2.8.0" - } - kubernetes = { - source = "hashicorp/kubernetes" - version = "2.18.1" - } - kubectl = { - source = "alekc/kubectl" - version = "2.0.1" - } - } - provider_meta "google" { - module_name = "blueprints/terraform/terraform-google-kubernetes-engine:kuberay/v0.1.0" - } -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/service_accounts.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/service_accounts.tf deleted file mode 100644 index 44edf7504..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/service_accounts.tf +++ /dev/null @@ -1,61 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -resource "google_service_account" "sa" { - project = var.project_id - account_id = var.service_account - display_name = "GCP SA for Ray" -} - -resource "google_service_account_iam_binding" "workload-identity-user" { - service_account_id = google_service_account.sa.name - role = "roles/iam.workloadIdentityUser" - - members = [ - "serviceAccount:${var.project_id}.svc.id.goog[${var.namespace}/${var.k8s_service_account}]", - ] - depends_on = [google_service_account.sa] -} - -resource "google_storage_bucket_iam_binding" "gcs-bucket-iam" { - bucket = var.gcs_bucket - role = "roles/storage.objectAdmin" - members = [ - "serviceAccount:${google_service_account.sa.account_id}@${var.project_id}.iam.gserviceaccount.com", - ] -} - -resource "kubernetes_service_account" "ksa" { - metadata { - name = var.k8s_service_account - namespace = var.namespace - } - automount_service_account_token = true -} - -resource "kubernetes_annotations" "ray-sa-annotations" { - api_version = "v1" - kind = "ServiceAccount" - metadata { - name = var.k8s_service_account - namespace = var.namespace - } - annotations = { - "iam.gke.io/gcp-service-account" = "${google_service_account.sa.account_id}@${var.project_id}.iam.gserviceaccount.com" - } - depends_on = [ - kubernetes_service_account.ksa, - google_service_account.sa - ] -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/variables.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/variables.tf deleted file mode 100644 index d2580042d..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/variables.tf +++ /dev/null @@ -1,42 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -variable "project_id" { - type = string - description = "GCP project id" -} - -variable "namespace" { - type = string - description = "Kubernetes namespace where resources are deployed" - default = "example" -} - -variable "service_account" { - type = string - description = "Google Cloud IAM service account for authenticating with GCP services" - default = "my-gcp-sa" -} - -variable "k8s_service_account" { - type = string - description = "k8s service account" - default = "my-ksa" -} - -variable "gcs_bucket" { - type = string - description = "GCS Bucket name" - default = "test-gcsfuse-1" -} \ No newline at end of file diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/versions.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/versions.tf deleted file mode 100644 index 53d5c8e95..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/service_accounts/versions.tf +++ /dev/null @@ -1,28 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -terraform { - required_providers { - google = { - source = "hashicorp/google" - } - kubernetes = { - source = "hashicorp/kubernetes" - version = "2.18.1" - } - } - provider_meta "google" { - module_name = "blueprints/terraform/terraform-google-kubernetes-engine:kuberay/v0.1.0" - } -} diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/variables.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/variables.tf deleted file mode 100644 index ffab5dc80..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/variables.tf +++ /dev/null @@ -1,43 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -variable "project_id" { - type = string - description = "GCP project id" - default = "saikatroyc-stateful-joonix" -} - -variable "namespace" { - type = string - description = "Kubernetes namespace where resources are deployed" - default = "example" -} - -variable "service_account" { - type = string - description = "Google Cloud IAM service account for authenticating with GCP services" - default = "my-gcp-sa" -} - -variable "k8s_service_account" { - type = string - description = "k8s service account" - default = "my-ksa" -} - -variable "gcs_bucket" { - type = string - description = "GCS Bucket name" - default = "test-gcsfuse-1" -} \ No newline at end of file diff --git a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/versions.tf b/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/versions.tf deleted file mode 100644 index 2b93bd40f..000000000 --- a/applications/ray/raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/versions.tf +++ /dev/null @@ -1,36 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -terraform { - required_providers { - google = { - source = "hashicorp/google" - } - helm = { - source = "hashicorp/helm" - version = "~> 2.8.0" - } - kubernetes = { - source = "hashicorp/kubernetes" - version = "2.18.1" - } - kubectl = { - source = "alekc/kubectl" - version = "2.0.1" - } - } - provider_meta "google" { - module_name = "blueprints/terraform/terraform-google-kubernetes-engine:kuberay/v0.1.0" - } -} diff --git a/applications/ray/tfvars_examples/README.md b/applications/ray/tfvars_examples/README.md new file mode 100644 index 000000000..5397adfa8 --- /dev/null +++ b/applications/ray/tfvars_examples/README.md @@ -0,0 +1,13 @@ +# Example terraform variables for Ray clusters + +This folder contains example terraform variable files to use with the [Ray on GKE terraform templates](/applications/ray/). + +To try one of the examples, edit the tfvars file and configure the mandatory variables. Then run the following commands: +``` +# from root of ai-on-gke +cd applications/ray +terraform init +terraform apply --var-file=tfvars_examples/ +``` + +See [Getting Started](/ray-on-gke/README.md#getting-started) for more details on using the examples in this repo. \ No newline at end of file diff --git a/applications/ray/tfvars_examples/raycluster-with-gcsfuse-volumes.tfvars b/applications/ray/tfvars_examples/raycluster-with-gcsfuse-volumes.tfvars new file mode 100644 index 000000000..007f8f624 --- /dev/null +++ b/applications/ray/tfvars_examples/raycluster-with-gcsfuse-volumes.tfvars @@ -0,0 +1,41 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +##common variables +## Need to pull this variables from tf output from previous platform stage +project_id = "" + +## this is required for terraform to connect to GKE master and deploy workloads +create_cluster = false # this flag will create a new standard public gke cluster in default network +cluster_name = "" +cluster_location = "us-central1" + +####################################################### +#### APPLICATIONS +####################################################### + +## GKE environment variables +kubernetes_namespace = "ray" + +# Creates a google service account & k8s service account & configures workload identity with appropriate permissions. +# Set to false & update the variable `workload_identity_service_account` to use an existing IAM service account. +create_service_account = true +workload_identity_service_account = "ray-service-account" + +# Bucket name should be globally unique. +create_gcs_bucket = true +gcs_bucket = "" +create_ray_cluster = true +ray_cluster_name = "ray-cluster" diff --git a/applications/ray/tfvars_examples/raycluster-with-grafana-dashboard.tfvars b/applications/ray/tfvars_examples/raycluster-with-grafana-dashboard.tfvars new file mode 100644 index 000000000..34953dce1 --- /dev/null +++ b/applications/ray/tfvars_examples/raycluster-with-grafana-dashboard.tfvars @@ -0,0 +1,38 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +##common variables +## Need to pull this variables from tf output from previous platform stage +project_id = "" + +## this is required for terraform to connect to GKE master and deploy workloads +create_cluster = false # this flag will create a new standard public gke cluster in default network +cluster_name = "" +cluster_location = "us-central1" + +####################################################### +#### APPLICATIONS +####################################################### + +## GKE environment variables +kubernetes_namespace = "ray" + +# Creates a google service account & k8s service account & configures workload identity with appropriate permissions. +# Set to false & update the variable `workload_identity_service_account` to use an existing IAM service account. +create_service_account = false + +# Bucket name should be globally unique. +create_gcs_bucket = false +enable_grafana_on_ray_dashboard = true diff --git a/applications/ray/tfvars_examples/raycluster-with-iap-auth.tfvars b/applications/ray/tfvars_examples/raycluster-with-iap-auth.tfvars new file mode 100644 index 000000000..1b4ebb499 --- /dev/null +++ b/applications/ray/tfvars_examples/raycluster-with-iap-auth.tfvars @@ -0,0 +1,59 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +##common variables +## Need to pull this variables from tf output from previous platform stage +project_id = "" + +## this is required for terraform to connect to GKE master and deploy workloads +create_cluster = false # this flag will create a new standard public gke cluster in default network +cluster_name = "" +cluster_location = "us-central1" + +####################################################### +#### APPLICATIONS +####################################################### + +## GKE environment variables +kubernetes_namespace = "ray" + +# Creates a google service account & k8s service account & configures workload identity with appropriate permissions. +# Set to false & update the variable `workload_identity_service_account` to use an existing IAM service account. +create_service_account = true +workload_identity_service_account = "ray-service-account" + +# Bucket name should be globally unique. +create_gcs_bucket = false +gcs_bucket = "ray-bucket-zydg" +create_ray_cluster = true +ray_cluster_name = "ray-cluster" +enable_grafana_on_ray_dashboard = false + +## IAP config - if you choose to disable IAP authenticated access for your endpoints, ignore everthing below this line. +create_brand = true + +## Ray Dashboard IAP Settings +ray_dashboard_add_auth = true # Set to true when using auth with IAP +ray_dashboard_support_email = "" +ray_dashboard_k8s_ingress_name = "ray-dashboard-ingress" +ray_dashboard_k8s_managed_cert_name = "ray-dashboard-managed-cert" +ray_dashboard_k8s_iap_secret_name = "ray-dashboard-iap-secret" +ray_dashboard_k8s_backend_config_name = "ray-dashboard-iap-config" +ray_dashboard_k8s_backend_service_port = 8265 + +ray_dashboard_domain = "" +ray_dashboard_client_id = "" +ray_dashboard_client_secret = "" +ray_dashboard_members_allowlist = "user:" diff --git a/applications/ray/tfvars_examples/raycluster-with-workload-identity.tfvars b/applications/ray/tfvars_examples/raycluster-with-workload-identity.tfvars new file mode 100644 index 000000000..ee9e2f34e --- /dev/null +++ b/applications/ray/tfvars_examples/raycluster-with-workload-identity.tfvars @@ -0,0 +1,41 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +##common variables +## Need to pull this variables from tf output from previous platform stage +project_id = "" + +## this is required for terraform to connect to GKE master and deploy workloads +create_cluster = false # this flag will create a new standard public gke cluster in default network +cluster_name = "" +cluster_location = "us-central1" + +####################################################### +#### APPLICATIONS +####################################################### + +## GKE environment variables +kubernetes_namespace = "ray" + +# Creates a google service account & k8s service account & configures workload identity with appropriate permissions. +# Set to false & update the variable `workload_identity_service_account` to use an existing IAM service account. +create_service_account = true +workload_identity_service_account = "ray-service-account" + +# Bucket name should be globally unique. +create_gcs_bucket = false +create_ray_cluster = true +ray_cluster_name = "ray-cluster" +enable_grafana_on_ray_dashboard = false diff --git a/applications/ray/tfvars_examples/simple-raycluster-with-existing-gke-cluster.tfvars b/applications/ray/tfvars_examples/simple-raycluster-with-existing-gke-cluster.tfvars new file mode 100644 index 000000000..120c5a5a2 --- /dev/null +++ b/applications/ray/tfvars_examples/simple-raycluster-with-existing-gke-cluster.tfvars @@ -0,0 +1,38 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +##common variables +## Need to pull this variables from tf output from previous platform stage +project_id = "" + +## this is required for terraform to connect to GKE master and deploy workloads +create_cluster = false # this flag will create a new standard public gke cluster in default network +cluster_name = "" +cluster_location = "us-central1" + +####################################################### +#### APPLICATIONS +####################################################### + +## GKE environment variables +kubernetes_namespace = "ray" + +# Creates a google service account & k8s service account & configures workload identity with appropriate permissions. +# Set to false & update the variable `workload_identity_service_account` to use an existing IAM service account. +create_service_account = false + +# Bucket name should be globally unique. +create_gcs_bucket = false +enable_grafana_on_ray_dashboard = false diff --git a/applications/ray/tfvars_examples/simple-raycluster-with-new-gke-cluster.tfvars b/applications/ray/tfvars_examples/simple-raycluster-with-new-gke-cluster.tfvars new file mode 100644 index 000000000..48ab1fef0 --- /dev/null +++ b/applications/ray/tfvars_examples/simple-raycluster-with-new-gke-cluster.tfvars @@ -0,0 +1,38 @@ +# Copyright 2023 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +##common variables +## Need to pull this variables from tf output from previous platform stage +project_id = "" + +## this is required for terraform to connect to GKE master and deploy workloads +create_cluster = true # this flag will create a new standard public gke cluster in default network +cluster_name = "ray-cluster" +cluster_location = "us-central1" + +####################################################### +#### APPLICATIONS +####################################################### + +## GKE environment variables +kubernetes_namespace = "ray" + +# Creates a google service account & k8s service account & configures workload identity with appropriate permissions. +# Set to false & update the variable `workload_identity_service_account` to use an existing IAM service account. +create_service_account = false + +# Bucket name should be globally unique. +create_gcs_bucket = false +enable_grafana_on_ray_dashboard = false diff --git a/ray-on-gke/README.md b/ray-on-gke/README.md index 9acd3b081..bb4f12eba 100644 --- a/ray-on-gke/README.md +++ b/ray-on-gke/README.md @@ -5,14 +5,18 @@ Most examples use the [`applications/ray`](/applications/ray) terraform module t ## Getting Started -### Create a RayCluster on an existing cluster +It is highly recommended to use the [infrastructure](/infrastructure/) terraform module to create your GKE cluster. + +### Create a RayCluster on a GKE cluster Edit `templates/workloads.tfvars` with your environment specific variables and configurations. -The following variables are required: +The following variables require configuration: * project_id * cluster_name * cluster_location +If you need a new cluster, you can specify `create_cluster: true`. + Run the following commands to install KubeRay and deploy a Ray cluster onto your existing cluster. ``` cd templates/ @@ -20,8 +24,6 @@ terraform init terraform apply --var-file=workloads.tfvars ``` -**NOTE**: you can also create a new GKE cluster by specifying `create_cluster: true`. - Validate that the RayCluster is ready: ``` $ kubectl get raycluster @@ -29,6 +31,8 @@ NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE ray-cluster-kuberay 1 1 ready 3m41s ``` +See [tfvars examples](./examples/tfvars/) to explore different configuration options for the Ray cluster using the [terraform templates](./templates). + ### Install Ray Ensure Ray is installed in your environment. See [Installing Ray](https://docs.ray.io/en/latest/ray-overview/installation.html) for more details. @@ -39,7 +43,7 @@ To submit a Ray job, first establish a connection to the Ray head. For this exam to connect to the Ray head via localhost. ```bash -$ kubectl -n ml port-forward service/ray-cluster-kuberay-head-svc 8265 & +$ kubectl -n ai-on-gke port-forward service/ray-cluster-kuberay-head-svc 8265 & ``` Submit a Ray job that prints resources available in your Ray cluster: @@ -79,7 +83,7 @@ To use the client, first establish a connection to the Ray head. For this exampl to connect to the Ray head Service via localhost. ```bash -$ kubectl -n ml port-forward service/ray-cluster-kuberay-head-svc 10001 & +$ kubectl -n ai-on-gke port-forward service/ray-cluster-kuberay-head-svc 10001 & ``` Next, define a Python script containing remote code you want to run on your Ray cluster. Similar to the previous example, @@ -113,10 +117,14 @@ See the following guides and tutorials for running Ray applications on GKE: * [Priority Scheduling with RayJob and Kueue](https://docs.ray.io/en/master/cluster/kubernetes/examples/rayjob-kueue-priority-scheduling.html) * [Gang Scheduling with RayJob and Kueue](https://docs.ray.io/en/master/cluster/kubernetes/examples/rayjob-kueue-gang-scheduling.html) * [RayTrain with GCSFuse CSI driver](./guides/raytrain-with-gcsfusecsi/) +* [Configuring KubeRay to use Google Cloud Storage Buckets in GKE](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gke-gcs-bucket.html) +* [Example Notebooks with Ray](./examples/notebooks/) +* [Example templates for Ray clusterse](./examples/tfvars/) ## Blogs & Best Practices * [Getting started with Ray on Google Kubernetes Engine](https://cloud.google.com/blog/products/containers-kubernetes/use-ray-on-kubernetes-with-kuberay) * [Why GKE for your Ray AI workloads?](https://cloud.google.com/blog/products/containers-kubernetes/the-benefits-of-using-gke-for-running-ray-ai-workloads) * [Advanced scheduling for AI/ML with Ray and Kueue](https://cloud.google.com/blog/products/containers-kubernetes/using-kuberay-and-kueue-to-orchestrate-ray-applications-in-gke) +* [How to secure Ray on Google Kubernetes Engine](https://cloud.google.com/blog/products/containers-kubernetes/securing-ray-to-run-on-google-kubernetes-engine) * [4 ways to reduce cold start latency on Google Kubernetes Engine](https://cloud.google.com/blog/products/containers-kubernetes/tips-and-tricks-to-reduce-cold-start-latency-on-gke) diff --git a/ray-on-gke/examples/tfvars b/ray-on-gke/examples/tfvars new file mode 120000 index 000000000..10a362169 --- /dev/null +++ b/ray-on-gke/examples/tfvars @@ -0,0 +1 @@ +../../applications/ray/tfvars_examples \ No newline at end of file