updated gcs method as default and docker method as alternate

GoogleCloudPlatform · Mar 7, 2024 · b9565ec · b9565ec
1 parent 9108fba
commit b9565ec
Show file tree

Hide file tree

Showing 6 changed files with 81 additions and 77 deletions.
diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md
@@ -1,25 +1,30 @@
 # AI on GKE: Benchmark TensorRT LLM on Triton Server
 
-This guide provides instructions for deploying and benchmarking a TensorRT Large Language Model (LLM) on Triton Inference Server within a Google Kubernetes Engine (GKE) environment. The process involves building a Docker container with the TensorRT LLM engine and deploying it to a GKE cluster. 
+This guide outlines the steps for deploying and benchmarking a TensorRT Large Language Model (LLM) on the Triton Inference Server within Google Kubernetes Engine (GKE). It includes the process of building a Docker container equipped with the TensorRT LLM engine and deploying this container to a GKE cluster.
 
 ## Prerequisites
 
-- Docker
-- Google Cloud SDK
-- Kubernetes CLI (kubectl)
-- Hugging Face account for model access
-- NVIDIA GPU drivers and CUDA toolkit (to build the TensorRTLLM Engine)
-
+Ensure you have the following prerequisites installed and set up:
+- Docker for containerization
+- Google Cloud SDK for interacting with Google Cloud services
+- Kubernetes CLI (kubectl) for managing Kubernetes clusters
+- A Hugging Face account to access models
+- NVIDIA GPU drivers and CUDA toolkit for building the TensorRT LLM Engine
 ## Step 1: Build the TensorRT LLM Engine and Docker Image
 
 1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine.
 
-2. **Setup the Docker container:**
-
-   ***Method 1: Add the Model repository and the relevant scripts to the image***
+2. **Upload the Model to Google Cloud Storage (GCS)**
 
-   Inside the `tritonllm_backend` directory, create a Dockerfile with the following content ensuring the Triton Server is ready with the necessary models and scripts.
+   Transfer your model engine to GCS using:
+   ```
+   gsutil cp -r your_model_folder gs://your_model_repo/all_models/ 
+   ```
+   *Ensure to replace your_model_repo with your actual GCS repository path.**
+
+   **Alternate method: Add the Model repository and the relevant scripts to the image**
 
+   Construct a new image from Nvidia's base image and integrate the model repository and necessary scripts directly into it, bypassing the need for GCS during runtime. In the `tritonllm_backend` directory, create a Dockerfile with the following content:
    ```Dockerfile
    FROM nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
 
@@ -35,7 +40,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
    CMD ["/opt/scripts/start_triton_servers.sh"]
    ```
 
-   The Shell script `/opt/scripts/start_triton_servers.sh` is like below: 
+    For the initialization script located at `/opt/scripts/start_triton_servers.sh`, follow the structure below:
 
    ```start_triton_servers.sh
    #!/bin/bash
@@ -48,33 +53,25 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 
     # Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo)
-    mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver \
+    /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix0_ : \
-    -n 1 /opt/tritonserver/bin/tritonserver \
+    --backend-config=python,shm-region-prefix-name=prefix0_\
+   /opt/tritonserver/bin/tritonserver \
     --model-repository=/all_models/inflight_batcher_llm \
     --disable-auto-complete-config \
-    --backend-config=python,shm-region-prefix-name=prefix1_ :```
+    --backend-config=python,shm-region-prefix-name=prefix1_```
    ```
-   *Build and Push the Docker Image:*
+    *Build and Push the Docker Image:*
 
-   Build the Docker image and push it to your container registry:
+    Build the Docker image and push it to your container registry:
 
    ```
    docker build -t your_registry/tritonserver_llm:latest .
    docker push your_registry/tritonserver_llm:latest
 
    ```
-   Replace `your_registry` with your actual Docker registry path.
-
-   ***Method 2: Upload Model repository to gcs***
-
-   In this method we can directly upload the model engine to gcs and use the base image provided by Nvidia and specify the command to launch the triton server via the deployment yaml file: 
-   ```
-   gsutil cp -r your_model_folder gs://your_model_repo/all_models/ 
-   ```
-   Replace `your_model_repo` with your actual gcs repo path.
+    Substitute `your_registry` with your Docker registry's path.
 
 
 
@@ -83,57 +80,42 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large
 
 1. **Initialize Terraform Variables:**
 
-   Create a `terraform.tfvars` file by copying the provided example:
+   Start by creating a `terraform.tfvars` file from the provided template:
 
    ```bash
    cp sample-terraform.tfvars terraform.tfvars
    ```
 
-2. Define `template_path`, `image_path` and `gpu_count` variables  in `terraform.tfvars`.
-
-   * If using method 1 In Step 1 above:
-   ```bash
-   template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" 
-   image_path = "path_to_your_registry/tritonserver_llm:latest"
-   gpu_count = X
-   ```
-   * If using method 2 In Step 1 above:
+2. **Specify Essential Variables**
 
-   ```bash
-   template_path = "path_to_manifest_template/triton-tensorrtllm-inference-gs.tftpl"
-   image_path = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
-   gpu_count = X
+   At a minimum, configure the `gcs_model_path` in `terraform.tfvars` to point to your Google Cloud Storage (GCS) repository. You may also need to adjust `image_path`, `gpu_count`, and `server_launch_command_string` according to your specific requirements.
+   ```bash   
+   gcs_model_path = gs://path_to_model_repo/all_models
    ```
-   and also update the `triton-tensorrtllm-inference-gs.tftpl` with the path to the gcs repo under InitContainer and the command to launch the container under Container section.
-
-3. **Configure Your Deployment:**
 
-   Edit the `terraform.tfvars` file to include your specific configuration details with the variable `credentials_config`. 
-   Example 
+   If `gcs_model_path` is not defined, it is inferred that you are utilizing the method of direct image integration. In this scenario, ensure to update `image_path` along with any pertinent variables accordingly:
 
-   ```bash
-   credentials_config = {
-     kubeconfig = "path/to/your/gcloud/credentials.json"
-   }
+  ```bash
+   image_path = "path_to_your_registry/tritonserver_llm:latest"
    ```
 
-#### [optional] set-up credentials config with kubeconfig 
+3. **Configure Your Deployment:**
+
+  Edit `terraform.tfvars` to tailor your deployment's configuration, particularly the `credentials_config`. Depending on your cluster's setup, this might involve specifying fleet host credentials or isolating your cluster's kubeconfig:
 
-If you created your cluster with steps from `../../infra/` or with fleet management enabled, the existing `credentials_config` must use the fleet host credentials like this:
+  ***For Fleet Management Enabled Clusters:***
 ```bash
 credentials_config = {
   fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME"
 }
-```
-
-
-If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the `credentials_config`. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command:
-
+``` 
+  ***For Clusters without Fleet Management:***
+Separate your cluster's kubeconfig:
 ```bash
 KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION
 ```
 
-Then update your `terraform.tfvars` `credentials_config` to the following:
+And then update your `terraform.tfvars`  accordingly:
 
 ```bash
 credentials_config = {
@@ -143,7 +125,7 @@ credentials_config = {
 }
 ```
 
-#### [optional] set up secret token in Secret Manager
+#### [optional] Setting Up Secret Tokens
 
 A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a [user access token](https://huggingface.co/docs/hub/en/security-tokens). If the model you want to run does not require this, skip this step.
 
@@ -152,7 +134,8 @@ If you followed steps from `../../infra/stage-2`, Secret Manager and the user ac
 kubectl create secret generic huggingface-secret --from-literal=token='************'
 ```
 
-This command creates a new Secret named huggingface-secret, which has a key token containing your Hugging Face CLI token.
+Executing this command generates a new Secret named huggingface-secret, which incorporates a key named token that stores your Hugging Face CLI token. It is important to note that for any production or shared environments, directly storing user access tokens as literals is not advisable.
+
 
 ## Step 3: login to gcloud
 
@@ -183,12 +166,16 @@ terraform apply
 |----------------------|-----------------------------------------------------------------------------------------------|---------|-------------------------------------------|----------|
 | `credentials_config` | Configure how Terraform authenticates to the cluster.                                         | Object  |                                           | No       |
 | `namespace`          | Namespace used for Nvidia DCGM resources.                                                     | String  | `"default"`                               | No       |
-| `image_path`         | Image Path stored in Artifact Registry                                                        | String  |                                           | No       |
+| `image_path`         | Image Path stored in Artifact Registry                                                        | String  |    "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"                                       | No       |
 | `model_id`           | Model used for inference.                                                                     | String  | `"meta-llama/Llama-2-7b-chat-hf"`         | No       |
 | `gpu_count`          | Parallelism based on number of gpus.                                                          | Number  | `1`                                       | No       |
 | `ksa`                | Kubernetes Service Account used for workload.                                                 | String  | `"default"`                               | No       |
 | `huggingface-secret` | Name of the kubectl huggingface secret token                                                  | String  | `"huggingface-secret"`                    | Yes       |
-| `templates_path`     | Path where manifest templates will be read from.     | String  |                                    | No      |
+| `gcs_model_path`     | Path where model engine in gcs will be read from.     | String  |    ""                                | No      |
+| `server_launch_command_string`     | Command to launc the Triton Inference Server     | String  |   "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"                                 | No      |
+
+
+
 
 ## Notes
 

diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf
@@ -14,13 +14,19 @@
  * limitations under the License.
  */
 
+ locals {
+  template_path = var.gcs_model_path == null ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl"
+}
+
 resource "kubernetes_manifest" "default" {
-  manifest = yamldecode(templatefile(var.template_path, {
-    namespace            = var.namespace
-    ksa                  = var.ksa
-    image_path           = var.image_path
-    huggingface-secret   = var.huggingface-secret
-    gpu_count            = var.gpu_count
-    model_id             = var.model_id
+  manifest = yamldecode(templatefile(local.template_path, {
+    namespace                    = var.namespace
+    ksa                          = var.ksa
+    image_path                   = var.image_path
+    huggingface-secret           = var.huggingface-secret
+    gpu_count                    = var.gpu_count
+    model_id                     = var.model_id
+    gcs_model_path               = var.gcs_model_path
+    server_launch_command_string = var.server_launch_command_string
   }))
 }
diff --git a/...plates/triton-tensorrtllm-inference.tftpl → ...triton-tensorrtllm-inference-docker.tftpl b/...plates/triton-tensorrtllm-inference.tftpl → ...triton-tensorrtllm-inference-docker.tftpl
diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl
@@ -43,7 +43,7 @@ spec:
           serviceAccountName: ${ksa}
           command: ["/bin/sh", "-c"] 
           args:
-            - gsutil cp -r gs://your_gcs_repo/all_models ./;
+            - gsutil cp -r ${gcs_model_path} ./;
           volumeMounts:
             - name: all-models-volume
               mountPath: /all_models
@@ -53,7 +53,7 @@ spec:
           workingDir: /opt/tritonserver
           #command: ["/bin/sleep", "3600"]
           command: ["/bin/bash", "-c"]
-          args: ["pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"]
+          args: ["${server_launch_command_string}"]
           ports:
             - containerPort: 8000
               name: http-triton
@@ -73,4 +73,6 @@ spec:
           serviceAccountName: ${ksa}
           volumeMounts:
             - name: all-models-volume
-              mountPath: /all_models
+              mountPath: /all_models
+            - mountPath: /dev/shm
+              name: dshm
diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars
@@ -7,4 +7,4 @@ ksa  = "benchmark-ksa"
 model_id = "meta-llama/Llama-2-7b-chat-hf"
 gpu_count = 1
 image_path = ""
-template_path = ""
+gcs_model_path = ""
diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf
@@ -34,7 +34,7 @@ variable "credentials_config" {
 }
 
 variable "namespace" {
-  description = "Namespace used for Nvidia DCGM resources."
+  description = "Namespace used for Nvidia Triton resources."
   type        = string
   nullable    = false
   default     = "default"
@@ -44,6 +44,7 @@ variable "image_path" {
   description = "Image Path stored in Artifact Registry"
   type        = string
   nullable    = false
+  default     = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3"
 }
 
 variable "model_id" {
@@ -74,9 +75,17 @@ variable "huggingface-secret" {
   default     = "huggingface-secret"
 }
 
-variable "template_path" {
-  description = "Path where manifest templates will be read from."
+variable "gcs_model_path" {
+  description = "Path to the GCS repo where model is stored"
   type        = string
-  nullable    = false
+  nullable    = true
+  default     = null
+}
+
+variable "server_launch_command_string" {
+  description = "command to launch the triton server"
+  type        = string
+  nullable    = true
+  default     = "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"
 }