NIM KServe Playground

This repository hosts example projects used for exploring KServe and Nvidia NIM with the goal of integrating Nvidia NIM into Red Hat OpenShift AI.

The pocs folder hosts the various POC scenarios designed with Kustomize.
The builds folder hosts built manifests from the above-mentioned pocs for accessibility.

All POC executions require Red Hat OpenShift AI.

POCs

Deployment Types

Kserve supports three types of deployment. We explored two of them. Serverless, and Raw.

Serverless Deployment

Serverless Deployment, the default deployment type for Kserve, it leverages Knative.


Model Used	kserve-sklearnserver
POC Instructions	Click here
Built Manifests	Click here

Key Takeaways

The storageUri specification from the InferenceService is used for triggering Kserve's Storage Initializer Container for downloading the model prior to runtime.

Raw Deployment

With Raw Deployment, Kserve leverages Kubernetes core resources.


Model Used	kserve-sklearnserver
POC Instructions	Click here
Built Manifests	Click here

Key Takeaways

The storageUri specification from the InferenceService is used for triggering Kserve's Storage Initializer Container for downloading the model prior to runtime.
Annotating the InferenceService with serving.kserve.io/deploymentMode: RawDeployment triggers a Raw Deployment.

Persistence and Caching

Prerequisites!

Before proceeding, grab your NGC API Key and create the following two secret data files (git-ignored):

The files are saved in the no-cache POC folder but are used by all scenarios in this context.

# the following will be used in an opaque secret mounted into the runtime
echo "NGC_API_KEY=ngcapikeygoeshere" > pocs/persistence-and-caching/no-cache/ngc.env

# the following will be used as the pull image secret for the underlying runtime deployment
echo "{
  \"auths\": {
    \"nvcr.io\": {
      \"username\": \"\$oauthtoken\",
      \"password\": \"ngcapikeygoeshere\"
    }
  }
}" > pocs/persistence-and-caching/no-cache/ngcdockerconfig.json

No caching or Persistence

In this scenario, Nvidia NIM is in charge of downloading the required models; however, the target volume is not persistent, and the download process will occur for every Pod created and will be reflected in scaling time.


Model Used	nvidia-nim-llama3-8b-instruct
POC Instructions	Click here
Built Manifests	Click here

Key Takeaways

The storageUri specification from the InferenceService is NOT required.
We set the NIM_CACHE_PATH environment variable is set to /mnt/models (empty-dir).

Knative PVC Feature

In this scenario, Nvidia NIM is in charge of downloading the required models; the download target is a PVC.

kubernetes.podspec-persistent-volume-claim: "enabled"
kubernetes.podspec-persistent-volume-write: "enabled"


Model Used	nvidia-nim-llama3-8b-instruct
POC Instructions	Click here
Built Manifests	Click here

Key Takeaways

The storageUri specification from the InferenceService is NOT required.
We added a PVC setting the storage class to OpenShift's default gp3-csi.
We added a Volume to the ServingRuntime connected to the above-mentioned PVC.
We added a VolumeMount to the ServingRuntime mounting the above-mentioned Volume to /mnt/nim/models.
We set the NIM_CACHE_PATH environment variable is set to above-mentioned /mnt/nim/models.

Kserve Raw NIM Deployment

In this scenario, Nvidia NIM is in charge of downloading the required models; the download target is a PVC. Using writable PVCs is applicable with Kserve's Raw Deployment.


Model Used	nvidia-nim-llama3-8b-instruct
POC Instructions	Click here
Built Manifests	Click here

Key Takeaways

The storageUri specification from the InferenceService is NOT required.
We added a PVC setting the storage class to OpenShift's default gp3-csi.
We added a Volume to the ServingRuntime connected to the above-mentioned PVC.
We added a VolumeMount to the ServingRuntime mounting the above-mentioned Volume to /mnt/nim/models.
We set the NIM_CACHE_PATH environment variable is set to above-mentioned /mnt/nim/models.
Annotating the InferenceService with serving.kserve.io/deploymentMode: RawDeployment triggers a Raw Deployment.
We added maxReplicas for the Predictor, which is required for using HPA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NIM KServe Playground

POCs

Deployment Types

Serverless Deployment

Raw Deployment

Persistence and Caching

No caching or Persistence

Knative PVC Feature

Kserve Raw NIM Deployment

Files

README.md

Latest commit

History

README.md

File metadata and controls

NIM KServe Playground

POCs

Deployment Types

Serverless Deployment

Raw Deployment

Persistence and Caching

No caching or Persistence

Knative PVC Feature

Kserve Raw NIM Deployment