Skip to content

Latest commit

 

History

History
150 lines (110 loc) · 6.44 KB

README.md

File metadata and controls

150 lines (110 loc) · 6.44 KB

NIM KServe Playground

This repository hosts example projects used for exploring KServe and Nvidia NIM with the goal of integrating Nvidia NIM into Red Hat OpenShift AI.

  • The pocs folder hosts the various POC scenarios designed with Kustomize.
  • The builds folder hosts built manifests from the above-mentioned pocs for accessibility.

All POC executions require Red Hat OpenShift AI.

POCs

Deployment Types

Kserve supports three types of deployment. We explored two of them. Serverless, and Raw.

Serverless Deployment

Serverless Deployment, the default deployment type for Kserve, it leverages Knative.

Model Used kserve-sklearnserver
POC Instructions Click here
Built Manifests Click here

Key Takeaways

  • The storageUri specification from the InferenceService is used for triggering Kserve's Storage Initializer Container for downloading the model prior to runtime.

Raw Deployment

With Raw Deployment, Kserve leverages Kubernetes core resources.

Model Used kserve-sklearnserver
POC Instructions Click here
Built Manifests Click here

Key Takeaways

  • The storageUri specification from the InferenceService is used for triggering Kserve's Storage Initializer Container for downloading the model prior to runtime.
  • Annotating the InferenceService with serving.kserve.io/deploymentMode: RawDeployment triggers a Raw Deployment.

Persistence and Caching

Prerequisites!

Before proceeding, grab your NGC API Key and create the following two secret data files (git-ignored):

The files are saved in the no-cache POC folder but are used by all scenarios in this context.

# the following will be used in an opaque secret mounted into the runtime
echo "NGC_API_KEY=ngcapikeygoeshere" > pocs/persistence-and-caching/no-cache/ngc.env
# the following will be used as the pull image secret for the underlying runtime deployment
echo "{
  \"auths\": {
    \"nvcr.io\": {
      \"username\": \"\$oauthtoken\",
      \"password\": \"ngcapikeygoeshere\"
    }
  }
}" > pocs/persistence-and-caching/no-cache/ngcdockerconfig.json

No caching or Persistence

In this scenario, Nvidia NIM is in charge of downloading the required models; however, the target volume is not persistent, and the download process will occur for every Pod created and will be reflected in scaling time.

Model Used nvidia-nim-llama3-8b-instruct
POC Instructions Click here
Built Manifests Click here

Key Takeaways

  • The storageUri specification from the InferenceService is NOT required.
  • We set the NIM_CACHE_PATH environment variable is set to /mnt/models (empty-dir).

Knative PVC Feature

In this scenario, Nvidia NIM is in charge of downloading the required models; the download target is a PVC.

kubernetes.podspec-persistent-volume-claim: "enabled"
kubernetes.podspec-persistent-volume-write: "enabled"
Model Used nvidia-nim-llama3-8b-instruct
POC Instructions Click here
Built Manifests Click here

Key Takeaways

  • The storageUri specification from the InferenceService is NOT required.
  • We added a PVC setting the storage class to OpenShift's default gp3-csi.
  • We added a Volume to the ServingRuntime connected to the above-mentioned PVC.
  • We added a VolumeMount to the ServingRuntime mounting the above-mentioned Volume to /mnt/nim/models.
  • We set the NIM_CACHE_PATH environment variable is set to above-mentioned /mnt/nim/models.

Kserve Raw NIM Deployment

In this scenario, Nvidia NIM is in charge of downloading the required models; the download target is a PVC. Using writable PVCs is applicable with Kserve's Raw Deployment.

Model Used nvidia-nim-llama3-8b-instruct
POC Instructions Click here
Built Manifests Click here

Key Takeaways

  • The storageUri specification from the InferenceService is NOT required.
  • We added a PVC setting the storage class to OpenShift's default gp3-csi.
  • We added a Volume to the ServingRuntime connected to the above-mentioned PVC.
  • We added a VolumeMount to the ServingRuntime mounting the above-mentioned Volume to /mnt/nim/models.
  • We set the NIM_CACHE_PATH environment variable is set to above-mentioned /mnt/nim/models.
  • Annotating the InferenceService with serving.kserve.io/deploymentMode: RawDeployment triggers a Raw Deployment.
  • We added maxReplicas for the Predictor, which is required for using HPA.