OpenShift AI has a single operator for the base installation. However, it is strongly recommended to install the OpenShift Pipelines operator as well to facilitate pipelines in the Data Science workflow. To that end, the below instructions will help walk you through the installation of the operators. In the artifacts directory of this repository, there are example YAMLs that can be used to install the operators.
In contrast to the OpenShift AI Operator, the Pipeline Operator only requires an appropriate subscription to become active on the cluster. The below YAML will create a Pipeline Operator that tracks the latest version available from Red Hat.
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/openshift-pipelines-operator-rh.openshift-operators: ""
name: openshift-pipelines-operator-rh
namespace: openshift-operators
spec:
channel: latest
installPlanApproval: Automatic
name: openshift-pipelines-operator-rh
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: openshift-pipelines-operator-rh.v1.14.1
You can create this object by running oc apply -f openshift_pipelines_sub
.
The OpenShift AI Operator has two objects to create before it is fully operational in the cluster:
- The Operator Subscription
- The Data Science Cluster
The following can be used to install version 2.7 of the OpenShift AI Operator from the fast
channel.
Note
For legacy reasons, the OpenShift AI is actually called the rhods-operator
(Red Hat OpenShift Data Science)
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/rhods-operator.redhat-ods-operator: ""
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: fast
installPlanApproval: Automatic
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: rhods-operator.2.7.0
You can create this object by running oc apply -f openshift_ai_subscription.yaml
.
If you need a different channel or version you will need to update the appropriate sections of the above YAML.
After the subscription has completed, the next step is to create a Data Science Cluster. The below YAML is sufficient for the vast majority of users. Advanced users are welcome to adjust the below options as they see fit
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
labels:
app.kubernetes.io/created-by: rhods-operator
app.kubernetes.io/instance: default-dsc
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: datasciencecluster
app.kubernetes.io/part-of: rhods-operator
name: default-dsc
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Managed
kserve:
managementState: Managed
serving:
ingressGateway:
certificate:
type: SelfSigned
managementState: Managed
name: knative-serving
modelmeshserving:
managementState: Managed
ray:
managementState: Removed
trustyai: {}
workbenches:
managementState: Managed
You can create this object by running openshift_ai_datasciencecluster.yaml
The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator.
The official documentation outlines the process to install the Node Feature Discovery (NFD) Operator. For convenience, the steps are summarized here. However, in the event that these instructions fail to work properly, always seek out the official documentation.
The first requirement for the NFD operator is to create the namespace to hold the subscription and the operator group used to deploy the requisite objects.
apiVersion: v1
kind: Namespace
metadata:
name: openshift-nfd
Create the namespace by running the following command:
oc create -f nfd-namespace.yaml
After creating the namespace, use the following YAML to create the required operator group:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
generateName: openshift-nfd-
name: openshift-nfd
namespace: openshift-nfd
spec:
targetNamespaces:
- openshift-nfd
Create the OperatorGroup by running the following command:
oc create -f nfd-operatorgroup.yaml
Finally subscribe to the appropriate channel to complete the operator installation:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nfd
namespace: openshift-nfd
spec:
channel: "stable"
installPlanApproval: Automatic
name: nfd
source: redhat-operators
sourceNamespace: openshift-marketplace
Create the subscription object by running the following command:
oc create -f nfd-sub.yaml
After a little time pods will begin to be created in the openshift-nfd
.
oc get pods -n openshift-nfd
Once the NFD operator has finished installing, you will need to create a NodeFeatureDiscovery instance. The below YAML is equivalent to accepting all of the default options if you were using the OpenShift web UI. While you can make alterations, none are required. The defaults are considered sane.
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd
spec:
instance: ''
operand:
image: >-
registry.redhat.io/openshift4/ose-node-feature-discovery:v<OCP VERSION>
servicePort: 12000
topologyUpdater: false
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
pci:
deviceClassWhitelist:
- "0200"
- "03"
- "12"
deviceLabelFields:
- "vendor"
Important
Replace <OCP VERSION>
in the above YAML with the version of OpenShift you are currently running, otherwise you will get an ImagePull error.
You can create this object by running oc apply -f nodefeaturediscovery_cr.yaml
.
Check the status by running oc get pods -n openshift-nfd
. There should be at least one pod for each of the nodes in the cluster.
You can look for the label nvidia.com/gpu.present=true
. You can run the following command to show any nodes that have the NVIDIA GPU present:
oc get nodes -l "feature.node.kubernetes.io/pci-10de.present=true"
After installing the NFD Operator, you can move forward with installing the NVIDIA GPU operator.
See the official NVIDIA documentation for the most update version of these instructions.
Note
During installation, the Driver Toolkit daemon set checks for the existence of a build-dockercfg secret for the Driver Toolkit service account. When the secret does not exist, the installation stalls.
You can run the following command to determine if your cluster is affected.
oc get configs.imageregistry.operator.openshift.io cluster -o jsonpath='{.spec.storage}{"\n"}'
If the output from the preceding command is empty, {}, then your cluster is affected and you must configure your registry to use storage
Before proceeding with the install, insure that the ImageStream is available on your cluster:
oc get -n openshift is/driver-toolkit
NVIDIA recommends creating nvidia-gpu-operator
for holding the requisite objects. As such, when creating the Operator via the UI, the Red Hat defaults will create this project and use it. Create the namespace with the following YAML:
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-gpu-operator
Run the following command to create the namespace oc apply -f nvidia-ns.yaml
Important
While it is possible to use a different namespace, namespace monitoring will not be enabled by default. You can enable monitoring with the following:
oc label ns/$NAMESPACE_NAME openshift.io/cluster-monitoring=true
Once the namespace is created, create an OperatorGroup object:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: nvidia-gpu-operator-group
namespace: nvidia-gpu-operator
spec:
targetNamespaces:
- nvidia-gpu-operator
Running the following command will create the CR:
oc apply -f nvidia-og.yaml
NVIDIA currently recommends having a manual installPlan Approval for the subscriptions. In order to get the current version for the default channel (which is the current recommended channel), you can run the following:
CHANNEL=$(oc get packagemanifest gpu-operator-certified -n openshift-marketplace -o jsonpath='{.status.defaultChannel}')
Once you have the current version you can use the following commands to get the CSV required for a subscription:
CSV=$(oc get packagemanifests/gpu-operator-certified -n openshift-marketplace -ojson | jq -r '.status.channels[] | select(.name == "'$CHANNEL'") | .currentCSV')
As long as you have set the previous environment variables you can then create your subscription YAML with the following command:
echo "apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: nvidia-gpu-operator
spec:
channel: ${CHANNEL}
installPlanApproval: Manual
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
startingCSV: ${CSV}
" > nvidia-operator-sub.yaml
This should create the YAML with the correct values intact. You can create this object by running oc apply -f nvidia-operator-sub.yaml
. Because the approval is set to manual, you need to approve the plan before it will be installed:
INSTALL_PLAN=$(oc get installplan -n nvidia-gpu-operator -oname)
oc patch $INSTALL_PLAN -n nvidia-gpu-operator --type merge --patch '{"spec":{"approved":true }}'
After the installation of the operator is completed, a ClusterPolicy needs to be created. If you are using the environment variables as in this guide you can generate a ClusterPolicy json file with the following command
oc get csv -n nvidia-gpu-operator ${CSV} -ojsonpath={.metadata.annotations.alm-examples} |jq .[0] > clusterpolicy.json
Then simply apply the file:
oc apply -f clusterpolicy.json
Note
The clusterpolicy.json
in the artifacts/NVIDIA_Operator
directory of this repo is for your convenience. While there is no version specific information contained within, NVIDIA may elect to change the contents of this file in the future. When in doubt, regenerate this file from the commands above.
After a short period of time you should start to see pods and daemon-sets being created in the nvidia-gpu-operator
project by running oc get pods,daemonset -n nvidia-gpu-operator
:
NAME READY STATUS RESTARTS AGE
pod/gpu-feature-discovery-526k5 1/1 Running 0 9m2s
pod/gpu-feature-discovery-8jlmt 1/1 Running 0 9m4s
pod/gpu-operator-bf9f9fc64-29hlh 1/1 Running 0 104m
pod/nvidia-container-toolkit-daemonset-4dg2v 1/1 Running 0 9m2s
pod/nvidia-container-toolkit-daemonset-g565d 1/1 Running 0 9m4s
pod/nvidia-cuda-validator-mrhmf 0/1 Completed 0 5m59s
pod/nvidia-cuda-validator-v74dm 0/1 Completed 0 5m58s
pod/nvidia-dcgm-df7cz 1/1 Running 0 9m4s
pod/nvidia-dcgm-exporter-b95k6 1/1 Running 0 9m2s
pod/nvidia-dcgm-exporter-lsxq8 1/1 Running 0 9m4s
pod/nvidia-dcgm-lhwgf 1/1 Running 0 9m2s
pod/nvidia-device-plugin-daemonset-d6jzv 1/1 Running 0 9m4s
pod/nvidia-device-plugin-daemonset-hrv5z 1/1 Running 0 9m2s
pod/nvidia-driver-daemonset-412.86.202303241612-0-7hnsp 2/2 Running 0 9m16s
pod/nvidia-driver-daemonset-412.86.202303241612-0-qtvrw 2/2 Running 0 9m19s
pod/nvidia-node-status-exporter-9nh4v 1/1 Running 0 9m16s
pod/nvidia-node-status-exporter-tbtfx 1/1 Running 0 9m19s
pod/nvidia-operator-validator-7zm8r 1/1 Running 0 9m4s
pod/nvidia-operator-validator-qt9x2 1/1 Running 0 9m2s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/gpu-feature-discovery 2 2 2 2 2 nvidia.com/gpu.deploy.gpu-feature-discovery=true 46m
daemonset.apps/nvidia-container-toolkit-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.container-toolkit=true 46m
daemonset.apps/nvidia-dcgm 2 2 2 2 2 nvidia.com/gpu.deploy.dcgm=true 46m
daemonset.apps/nvidia-dcgm-exporter 2 2 2 2 2 nvidia.com/gpu.deploy.dcgm-exporter=true 46m
daemonset.apps/nvidia-device-plugin-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.device-plugin=true 46m
daemonset.apps/nvidia-driver-daemonset-412.86.202303241612-0 2 2 2 2 2 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202303241612-0,nvidia.com/gpu.deploy.driver=true 46m
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 46m
daemonset.apps/nvidia-node-status-exporter 2 2 2 2 2 nvidia.com/gpu.deploy.node-status-exporter=true 46m
daemonset.apps/nvidia-operator-validator 2 2 2 2 2 nvidia.com/gpu.deploy.operator-validator=true 46m
The GPU Operator generates GPU performance metrics (DCGM-export), status metrics (node-status-exporter) and node-status alerts. For OpenShift Prometheus to collect these metrics, the namespace hosting the GPU Operator must have the label openshift.io/cluster-monitoring=true
.
By default, monitoring will be enabled on the nvidia-gpu-operator
namespace. This is the default namespace that is recommended to install the NVIDIA objects into.
Note
If you are using a non-default project name you need to run
oc label ns/$NAMESPACE openshift.io/cluster-monitoring=true
in order to enable cluster monitoring.
To enable the monitoring dashboard in the OpenShift UI, you first need to get the appropriate object definition. The JSON file is too large to be included directly in the body of this text. However, it can be found in the artifacts/NVIDIA_Operator
directory. This is provided for convenience and NVIDIA may choose to update this file. Always grab the latest version of this file when in doubt:
curl -LfO https://github.com/NVIDIA/dcgm-exporter/raw/main/grafana/dcgm-exporter-dashboard.json
After downloading the file, you need to create a configmap from the file:
oc create configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed --from-file=dcgm-exporter-dashboard.json
Important
The dashboard is not exposted via ANY view in the OpenShift UI by default. You can choose to enable this view in both the Administrator and Developer OpenShift Views if you so choose.
To enable the dashboard in the Administrator view in the OpenShift Web UI run the following:
oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/dashboard=true"
To enable the dashboard for the developer view as well run the following command:
oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/odc-dashboard=true"
Below you can find a summation of the official NVIDIA documentation on how to enable the GPU Operator Dashboard. This enables GPU related information in the Cluster Utilization
section of the OpenShift Dashboard.
Note
Helm is required to install these components
First, add the repo and update helm:
helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
helm update
Next, install the helm chart in the nvidia-gpu-operator
namespace:
helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu
You should see output similar to the following:
NAME: console-plugin-nvidia-gpu
LAST DEPLOYED: Thu Mar 14 09:35:36 2024
NAMESPACE: nvidia-gpu-operator
STATUS: deployed
REVISION: 1
In order to verify whether the plugin has been installed you can run the following command:
oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
The output will look similar to this:
["kubevirt-plugin","monitoring-plugin","pipelines-console-plugin"]
If you don't see the console-plugin-nvidia-gpu
in the list you can patch it in:
oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
Note
The above patch command is NOT idempotent
If the plugin is shown in the output like this
["kubevirt-plugin","monitoring-plugin","pipelines-console-plugin","console-plugin-nvidia-gpu"]
You can enable it with the following patch:
oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
Note
The above patch IS idempotent. If run multiple times it will return
console.operator.openshift.io/cluster patched (no change)
You can view the results by opening the OpenShift Web UI and going to Home --> Overview. You should now have GPU related infomation:
Note
If the plugin does not show in the dashboard, this is usually related to the user's browser session. Try using a private browsing option. If this is successful, you can try (from a normal browsing session) logging out, clearing the cache and logging back in.
Another option is to use the driver pods to get information from the cards directly. In the nvidia-gpu-operator
project you can oc exec
into the driver pod:
oc exec -it $(oc get pod -owide -lopenshift.driver-toolkit=true -o name) -- nvidia-smi
This will get the driver pod name and run the nvidia-smi
command which will produce output similar to the following:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro P6000 On | 00000000:05:00.0 Off | Off |
| 26% 25C P8 9W / 250W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
There are 3 options when it comes to sharing GPU resources amongst several tasks:
- Multi-Instance GPU (MIG) which is only available in A100 or A30 GPU
- Multi-Process Service (MPS)
- Time Slicing
Note
MPS support for the k8s-device-plugin
was only introduced mid-March 2024. As such, may not be available in your version of OpenShift. As of this writing, no version of OpenShift supports MPS in the NVIDIA GPU Operator.
For most NVIDIA cards time slicing is currently the only option on OpenShift (aside from the aforementioned A100 or A30 cards).
The device-plugin-config
tells the cluster which GPU configurations are available in your cluster. As such, this file cannot be generically applied. In general, the header in the data:
section of the configmap needs to match the model number of the GPU you wish to configure. NVIDIA provides the following example:
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: nvidia-gpu-operator
data:
A100-SXM4-40GB: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 8
A100-SXM4-80GB: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 8
Tesla-T4: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 8
Note
The failRequestsGreaterThanOne flag is meant to help users understand this subtlety, by treating a request of 1 as an access request rather than an exclusive resource request. NVIDIA recommends setting failRequestsGreaterThanOne=true
, but it is set to false by default. See the time-slicing-full-example.yaml
in the artifacts/NVIDIA_Operator
directory of this repo for proper placement.
For this simple example we will use a single card configuration:
kind: ConfigMap
apiVersion: v1
metadata:
name: time-slicing-config
namespace: nvidia-gpu-operator
data:
Tesla-V100-SXM2: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
Note
There is a difference between NVIDIA's official documentation and Red Hat's in terms of the name of the configMap. NVIDIA uses the name device-plugin-config
while Red Hat's documentation uses time-slicing-config
. Whichever name you use, ensure the correct references are used throughout the configurations used.
Important
There is no memory or fault-isolation between replicas for time-slicing! Under the hood, Compute Unified Device Architecture (CUDA) time-slicing is used to multiplex workloads from replicas of the same underlying GPU.
Once you are satisfied with the configuration for you cluster, create the configMap object with the following command:
oc create -f device-plugin-config.yaml
The ClusterPolicy created earlier can be updated with an oc patch
command:
oc patch clusterpolicy gpu-cluster-policy -n nvidia-gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default":"tesla-v100-sxm2"}}}}'
Note
The oc patch
command above denotes a default devicePlugin
config. This will apply the configuration, in this case time slicing, to all nodes which have a tesla-v100-sxm2
installed. Working in combination with the NFD, this will apply time slicing to all nodes with a V100 instead of having to label nodes individually. If you wish to follow omit the default, simply remove "default":tesla-v100-sxm2"
from the patch command.
In order for the ClusterPolicy to apply to specific nodes, you will need to label them appropriately. You might choose to label individual nodes in the event that you have different GPUs available in the same cluster or you only want to apply time slicing to specific GPUs hosted by specific nodes.
You may want to override a nodes' label in order to apply a specific type of configuration. In this case running the following command will apply a label denoting a Tesla V100 GPU:
oc label --overwrite node nvidia.com/device-plugin.config=Tesla-V100-SXM2
After a minute or so you can verify that the nodes now have the appropriate status:
oc get node --selector=nvidia.com/gpu.product=Tesla-V100-SXM2-16GB-SHARED -o json | jq '.items[0].status.capacity'
Note
A -SHARED
suffix has been applied to the node label to indicate that time slicing has been enabled. This can be disabled in the original configMap by setting data.${GPU}.sharing.timeSlicing.renameByDefault=false
The output may look similar to this:
{
"attachable-volumes-aws-ebs": "39",
"cpu": "8",
"ephemeral-storage": "104322028Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "62850120Ki",
"nvidia.com/gpu": "4",
"pods": "250"
}
The "nvidia.com/gpu": "4"
indicates that 4 replicas are available for CUDA allocation.
To validate that configurations have been applied appropriately you can look for the gfd
label:
oc get node --selector=nvidia.com/gpu.product=Tesla-V100-SXM2-16GB-SHARED -o json | jq '.items[0].metadata.labels' | grep gfd
If you are running on a virtual platform with the ability to attach GPUs, you can edit the MachineSet to ensure that the nvidia.com/device-plugin.config
is automatically applied to new machines. The following patch command will updated your MachineSet
oc patch machineset ${MACHINE_SET} \
-n openshift-machine-api --type merge \
--patch '{"spec": {"template": {"spec": {"metadata": {"labels": {"nvidia.com/device-plugin.config": "Tesla-V100-SXM2"}}}}}}'
In order to organize models, notebook images and other OpenShift AI artifacts you need to create a workbench. Workbenches are created in the context of distinct Data Science Projects.
If you are looking to create a Workbench and associated artifacts there are 5 components that will need to be created:
- The Data Science Project
- The Persistent Volume Claim for MariaDB
- The Data Connection
- The Appropriate Role Binding
- The Workbench object itself (also known as
notebook
)
The required YAML is found below. These will have to be adjusted to suite your environment.
The following YAML is fairly straight forward. Ensure that you replace the placeholds with appropriate values
apiVersion: v1
kind: Namespace
metadata:
annotations:
openshift.io/description: "data-science-project for <PROJECT>"
openshift.io/display-name: "<PROJECT>"
labels:
kubernetes.io/metadata.name: <PROJECT>
modelmesh-enabled: "true"
opendatahub.io/dashboard: "true"
name: <PROJECT>
After creating this file, simply run oc apply -f workbench_namespace.yaml
Important
The following assumes that you have dynamic storage that will create the appropriate persistent volume once a claim is registered. If not you will need to create a corresponding Persistent Volume to provide storage for the Claim
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
annotations:
openshift.io/description: "Storage for <WORKBENCH_NAME>"
openshift.io/display-name: <WORKBENCH_NAME>
name: <WORKBENCH_NAME>
namespace: <PROJECT>
finalizers:
- kubernetes.io/pvc-protection
labels:
opendatahub.io/dashboard: 'true'
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
Again use oc apply -f workbench_pvc.yaml
to create the storage for the workbench.
Assuming you are using S3 compatible storage, the following secret will enable your workbench to use Object Storage:
kind: Secret
apiVersion: v1
metadata:
name: <DATACONNECTION_NAME>
namespace: <PROJECT>
labels:
opendatahub.io/dashboard: 'true'
opendatahub.io/managed: 'true'
annotations:
opendatahub.io/connection-type: s3
openshift.io/display-name: <DATACONNECTION_NAME>
stringData:
AWS_ACCESS_KEY_ID: <ACCESS_KEY>
AWS_DEFAULT_REGION: <REGION>
AWS_S3_BUCKET: <BUCKET_NAME>
AWS_S3_ENDPOINT: <HTTP_ENDPOINT>:<PORT>
AWS_SECRET_ACCESS_KEY: <SECRET_KEY>
type: Opaque
Important
The endpoint should be an http
or https
endpoint and include a port if appropriate. For example AWS_S3_ENDPOINT: http://192.168.11.11:9000
.
Again use oc apply -f workbench_dataconnection.yaml
to create the Obejct Storage for the workbench.
The following role binding gives the admin
role to the specified user or group. When creating the workbench via the OpenShift AI user interface, the default is to create this role binding and apply it to the user who creates the workbench. It may be desirable to have the following YAML apply to a group instead of an individual
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
opendatahub.io/dashboard: "true"
opendatahub.io/project-sharing: "true"
name: rhods-rb-<WORKBENCH_NAME>
namespace: <PROJECT>
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: <USER/GROUP>
name: <GROUP/USERNAME>
Note
The above kind
can be either User
or Group
. Chose wisely according to your company's security policy.
This YAML is large. It sets things like the callback URLs, the OAUTH, notebook arguments, image type, secret references and so on. In an OpenShift cluster running oc get workbench -A
will not yield any results. Instead the object that makes up the majority of the Workbench configuration is inside of the Notebook
object. Thus running oc get notebook
will return a list of the Workbenches available in the Data Science Project.
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
annotations:
notebooks.opendatahub.io/inject-oauth: "true"
notebooks.opendatahub.io/last-image-selection: s2i-generic-data-science-notebook:2023.2
notebooks.opendatahub.io/last-size-selection: Small
notebooks.opendatahub.io/oauth-logout-url: https://rhods-dashboard-redhat-ods-applications.<CLUSTER_URL>/projects/<PROJECT>?notebookLogout=<WORKBENCH_NAME>
opendatahub.io/accelerator-name: ""
opendatahub.io/image-display-name: Standard Data Science
openshift.io/description: <WORKBENCH_NAME>
openshift.io/display-name: <WORKBENCH_NAME>
backstage.io/kubernetes-id: <WORKBENCH_NAME>
generation: 1
labels:
app: <WORKBENCH_NAME>
opendatahub.io/dashboard: "true"
opendatahub.io/odh-managed: "true"
name: <WORKBENCH_NAME>
namespace: <PROJECT>
spec:
template:
spec:
affinity: {}
containers:
- env:
- name: NOTEBOOK_ARGS
value: |-
--ServerApp.port=8888
--ServerApp.token=''
--ServerApp.password=''
--ServerApp.base_url=/notebook/<PROJECT>/<WORKBENCH_NAME>
--ServerApp.quit_button=False
--ServerApp.tornado_settings={"user":"stratus","hub_host":"https://rhods-dashboard-redhat-ods-applications.<CLUSTER_URL>","hub_prefix":"/projects/<PROJECT>"}
- name: JUPYTER_IMAGE
value: <full url to image in OpenShift Registry>
envFrom:
- secretRef:
name: <data connection secret name>
image: <full url to image in OpenShift Registry>
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /notebook/<PROJECT>/<WORKBENCH_NAME>/api
port: notebook-port
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: <WORKBENCH_NAME>
ports:
- containerPort: 8888
name: notebook-port
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /notebook/<PROJECT>/<WORKBENCH_NAME>/api
port: notebook-port
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: "1"
memory: 8Gi
volumeMounts:
- mountPath: /opt/app-root/src
name: <WORKBENCH_NAME>
- mountPath: /dev/shm
name: shm
workingDir: /opt/app-root/src
- args:
- --provider=openshift
- --https-address=:8443
- --http-address=
- --openshift-service-account=<WORKBENCH_NAME>
- --cookie-secret-file=/etc/oauth/config/cookie_secret
- --cookie-expire=24h0m0s
- --tls-cert=/etc/tls/private/tls.crt
- --tls-key=/etc/tls/private/tls.key
- --upstream=http://localhost:8888
- --upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
- --email-domain=*
- --skip-provider-button
- --openshift-sar={"verb":"get","resource":"notebooks","resourceAPIGroup":"kubeflow.org","resourceName":"<WORKBENCH_NAME>","namespace":"$(NAMESPACE)"}
- --logout-url=https://rhods-dashboard-redhat-ods-applications.apps.one.ocp4.x86experts.com/projects/<PROJECT>?notebookLogout=<WORKBENCH_NAME>
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /oauth/healthz
port: oauth-proxy
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: oauth-proxy
ports:
- containerPort: 8443
name: oauth-proxy
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /oauth/healthz
port: oauth-proxy
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 64Mi
requests:
cpu: 100m
memory: 64Mi
volumeMounts:
- mountPath: /etc/oauth/config
name: oauth-config
- mountPath: /etc/tls/private
name: tls-certificates
enableServiceLinks: false
serviceAccountName: <WORKBENCH_NAME>
volumes:
- name: <WORKBENCH_NAME>
persistentVolumeClaim:
claimName: <WORKBENCH_NAME>
- emptyDir:
medium: Memory
name: shm
- name: oauth-config
secret:
defaultMode: 420
secretName: <WORKBENCH_NAME>-oauth-config
- name: tls-certificates
secret:
defaultMode: 420
secretName: <WORKBENCH_NAME>-tls
If there is a problem with the workbench starting or being created, you can always run oc describe notebook <notebook name>
and review the status section for clues as to where the problem may lay.
OpenShift AI relies on most of the underlying mechanisms in OpenShift for controlling access to OpenShift AI. In vanilla OpenShift, you create projects which then have various security methods applied to them in order to ensure that administrators have fine-grained control over who can access which objects and with what permissions. A Data Science Project is just an OpenShift Project that has a specific label which enables it to be used with the OpenShift AI Dashboard.
Specifically, the label opendatahub.io/dashboard=true
allows the project to be interacted with from the OpenShift AI Dashboard.
The below sample command will label some-proj
for use with the Dashboard.
oc label namespace some-proj opendatahub.io/dashboard=true
Optionally, you can also add modelmesh-enabled='true'
where applicable to further enhance a project.
Because Data Science Projects are simply OpenShift Projects with an extra label or two, the same rules apply. Namely, by default the self-provisioner
is assigned to the system:authenticated
group. This means that in the default configuration, anyone who can log in to OpenShift, can create a Data Science Project. In addition, the user that creates the project will automatically become the project administrator.
The process of granting access to a Data Science Project is the same as a regular OpenShift project. In general there are view
, edit
and admin
roles. You would simply issue the following command:
oc adm policy add-role-to-group <role> <group name> -n <data science project name>
In order to prevent users from creating their own Data Science Project, you will need to patch the clusterrolebinding:
oc patch clusterrolebinding.rbac self-provisioners -p '{"subjects": null}'
Warning
This will disable all users from creating all projects in the OpenShift cluster where OpenShift AI is running
Access to the Dashboard can be controlled by editing the odh-dashboard-config
which resides in the redhat-ods-applications
project.
You will see the following:
apiVersion: opendatahub.io/v1alpha
kind: OdhDashboardConfig
metadata:
[...]
name: odh-dashboard-config
namespace: redhat-ods-applications
spec:
dashboardConfig:
[...]
groupsConfig:
adminGroups: rhods-admins
allowedGroups: system:authenticated
Editing the groupsConfig
will cause the operator to reconcile and update the Dashboard pods. It is also possible to create a YAML file that changes just the groupsConfig
with the following contents:
apiVersion: opendatahub.io/v1alpha
kind: OdhDashboardConfig
metadata:
name: odh-dashboard-config
namespace: redhat-ods-applications
spec:
groupsConfig:
allowedGroups: "system:authenticated"
You can then apply the changes by using oc apply -f <myfile>.yaml
Important
If you want to specify a list of groups, they need to be comma separated:
spec:
groupsConfig:
allowedGroups: "system:authenticated,rhods-users"
Alternatively you can use a command similar to the following to patch in your changes:
oc patch odhdashboardconfig/odh-dashboard-config -p '{"spec":{"groupsConfig": {"allowedGroups":"rhods-users"}}}' --type merge
You can follow the same process for editing the adminGroups
instead of the allowedGroups
. This group specifies which group of users will have admin access to the OpenShift AI Dashboard
It is often desirable to adjust the default amount of storage provided to a Jupyter Notebook in order to accommodate the standard workloads your organization may wish to deploy. By default OpenShift AI will allocate 20G with of disk space to every Jupyter Notebook that is created.
In the redhat-ods-applications
project the odhdashboardconfig
contains some of the default options for the notebookController
. Specifically:
notebookController:
enabled: true
notebookNamespace: rhods-notebooks
notebookTolerationSettings:
enabled: false
key: NotebooksOnly
pvcSize: 5Gi
You can edit this with an oc patch
command similar to the following
oc patch odhdashboardconfig/odh-dashboard-config -p '{"spec":{"notebookController": {"pvcSize":"6Gi"}}}' --type merge
Alternatively you can create a file with contents similar to this:
apiVersion: opendatahub.io/v1alpha
kind: OdhDashboardConfig
metadata:
name: odh-dashboard-config
namespace: redhat-ods-applications
spec:
notebookController:
pvcSize: "5Gi"
and then use oc apply -f <myfile>.yaml
to apply the changes
The Notebook sizes that appear as options in the OpenShift AI dashboard are controlled by the OdhDashboardConfig
which resides in the redhat-ods-applications
namespace.
Under the spec
section you will find notebookSizes
, also represented by "{'spec': {'notebookSizes'}}"
. This object is a list of attributes. Red Hat recommends at least the following attributes:
- name: <name>
resources:
limits:
cpu: <# cpus>
memory: <ram in Mi or Gi>
requests:
cpu: <# cpus>
memory: <ram in Mi or Gi>
Once this object has been edited, the operator will eventually reconcil the rhods-dashboard-
pods. A refresh of the OpenShift AI webpage may be required.
Note
If the operator is taking longer than you would like you can delete the appropriate pods with the following command:
oc delete pods -l app=rhods-dashboard -n redhat-ods-applications
You can reduce resource usage in your OpenShift AI deployment by stopping notebook servers that have had no user logged in for some time. By default, idle notebooks are not stopped automatically.
In the redhat-ods-applications
project there is a ConfigMap
that tracks the users preferences with regards to notebooks. It is called notebook-controller-config
. You will see the appropriate headings already in the data
section of the ConfigMap
:
apiVersion: v1
kind: ConfigMap
data:
ADD_FSGROUP: "false"
CLUSTER_DOMAIN: cluster.local
CULL_IDLE_TIME: "1440"
ENABLE_CULLING: "false"
IDLENESS_CHECK_PERIOD: "1"
ISTIO_GATEWAY: kubeflow/kubeflow-gateway
USE_ISTIO: "false"
The CULL_IDLE_TIME
is expressed in minutes as is the IDLENESS_CHECK_PERIOD
(which controls the polling frequency).
It is possible to update the ConfigMap
via a command similar to the following:
oc patch configmap notebook-controller-config -p '{"data":{"ENABLE_CULLING":"true"}}' --type merge
You can also use a patch file similar to this:
apiVersion: v1
data:
CULL_IDLE_TIME: "144"
ENABLE_CULLING: "false"
IDLENESS_CHECK_PERIOD: "1"
kind: ConfigMap
metadata:
name: notebook-controller-config
namespace: redhat-ods-applications
And then use an oc apply
to merge the changes
Before you can successfully create a pipeline in OpenShift AI, you must configure a pipeline server. This process is very straight forward. You need to have
- A Workbench
- Cluster storage for the workbench
- A Data Connection
In order to create a Pipeline server via the CLI a new object called DataSciencePipelinesApplication
needs to be created in the Data Science Project. This file requires some adjusting for your specific environment.
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1alpha1
kind: DataSciencePipelinesApplication
metadata:
name: pipelines-definition
spec:
apiServer:
applyTektonCustomResource: true
archiveLogs: false
autoUpdatePipelineDefaultVersion: true
collectMetrics: true
dbConfigConMaxLifetimeSec: 120
deploy: true
enableOauth: true
enableSamplePipeline: false
injectDefaultScript: true
stripEOF: true
terminateStatus: Cancelled
trackArtifacts: true
database:
disableHealthCheck: false
mariaDB:
deploy: true
pipelineDBName: mlpipeline
pvcSize: <CHANGE_ME>
username: mlpipeline
mlmd:
deploy: false
objectStorage:
disableHealthCheck: false
externalStorage:
bucket: <CHANGE_ME>
host: <CHANGE_ME>
port: ""
s3CredentialsSecret:
accessKey: AWS_ACCESS_KEY_ID
secretKey: AWS_SECRET_ACCESS_KEY
secretName: <CHANGE_ME>
scheme: http
persistenceAgent:
deploy: true
numWorkers: 2
scheduledWorkflow:
cronScheduleTimezone: UTC
deploy: true
In the above file the pvcSize
by default is set to 10Gi, though this can be changed by altering this value. In addition, the Pipeline Server expects to have access to S3 compatible Object Storage. In the above example, the secretName
is based on the Data Connection created earlier. OpenShift will prefix the name of the secret with the type of connection. In this case, the name of the storage is dev-storage
so the secret is called aws-connection-dev-storage
. Thus you would put secretName: aws-connection-dev-storage
in the above YAML.
Once the above object is created in the project, you can check the status by running oc describe dspa
. After a few minutes, the pods will have finished deploying and the cluster is ready to import Pipelines.