diff --git a/benchmarks/inference-server/triton/README.md b/benchmarks/inference-server/triton/README.md index 64422b375..17e52fc28 100644 --- a/benchmarks/inference-server/triton/README.md +++ b/benchmarks/inference-server/triton/README.md @@ -1,25 +1,30 @@ # AI on GKE: Benchmark TensorRT LLM on Triton Server -This guide provides instructions for deploying and benchmarking a TensorRT Large Language Model (LLM) on Triton Inference Server within a Google Kubernetes Engine (GKE) environment. The process involves building a Docker container with the TensorRT LLM engine and deploying it to a GKE cluster. +This guide outlines the steps for deploying and benchmarking a TensorRT Large Language Model (LLM) on the Triton Inference Server within Google Kubernetes Engine (GKE). It includes the process of building a Docker container equipped with the TensorRT LLM engine and deploying this container to a GKE cluster. ## Prerequisites -- Docker -- Google Cloud SDK -- Kubernetes CLI (kubectl) -- Hugging Face account for model access -- NVIDIA GPU drivers and CUDA toolkit (to build the TensorRTLLM Engine) - +Ensure you have the following prerequisites installed and set up: +- Docker for containerization +- Google Cloud SDK for interacting with Google Cloud services +- Kubernetes CLI (kubectl) for managing Kubernetes clusters +- A Hugging Face account to access models +- NVIDIA GPU drivers and CUDA toolkit for building the TensorRT LLM Engine ## Step 1: Build the TensorRT LLM Engine and Docker Image 1. **Build the TensorRT LLM Engine:** Follow the instructions provided in the [TensorRT LLM backend repository](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build the TensorRT LLM Engine. -2. **Setup the Docker container:** - - ***Method 1: Add the Model repository and the relevant scripts to the image*** +2. **Upload the Model to Google Cloud Storage (GCS)** - Inside the `tritonllm_backend` directory, create a Dockerfile with the following content ensuring the Triton Server is ready with the necessary models and scripts. + Transfer your model engine to GCS using: + ``` + gsutil cp -r your_model_folder gs://your_model_repo/all_models/ + ``` + *Ensure to replace your_model_repo with your actual GCS repository path.** + + **Alternate method: Add the Model repository and the relevant scripts to the image** + Construct a new image from Nvidia's base image and integrate the model repository and necessary scripts directly into it, bypassing the need for GCS during runtime. In the `tritonllm_backend` directory, create a Dockerfile with the following content: ```Dockerfile FROM nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 @@ -35,7 +40,7 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large CMD ["/opt/scripts/start_triton_servers.sh"] ``` - The Shell script `/opt/scripts/start_triton_servers.sh` is like below: + For the initialization script located at `/opt/scripts/start_triton_servers.sh`, follow the structure below: ```start_triton_servers.sh #!/bin/bash @@ -48,33 +53,25 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large # Launch the servers (modify this depending on the number of GPU used and the exact path to the model repo) - mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver \ + /opt/tritonserver/bin/tritonserver \ --model-repository=/all_models/inflight_batcher_llm \ --disable-auto-complete-config \ - --backend-config=python,shm-region-prefix-name=prefix0_ : \ - -n 1 /opt/tritonserver/bin/tritonserver \ + --backend-config=python,shm-region-prefix-name=prefix0_\ + /opt/tritonserver/bin/tritonserver \ --model-repository=/all_models/inflight_batcher_llm \ --disable-auto-complete-config \ - --backend-config=python,shm-region-prefix-name=prefix1_ :``` + --backend-config=python,shm-region-prefix-name=prefix1_``` ``` - *Build and Push the Docker Image:* + *Build and Push the Docker Image:* - Build the Docker image and push it to your container registry: + Build the Docker image and push it to your container registry: ``` docker build -t your_registry/tritonserver_llm:latest . docker push your_registry/tritonserver_llm:latest ``` - Replace `your_registry` with your actual Docker registry path. - - ***Method 2: Upload Model repository to gcs*** - - In this method we can directly upload the model engine to gcs and use the base image provided by Nvidia and specify the command to launch the triton server via the deployment yaml file: - ``` - gsutil cp -r your_model_folder gs://your_model_repo/all_models/ - ``` - Replace `your_model_repo` with your actual gcs repo path. + Substitute `your_registry` with your Docker registry's path. @@ -83,57 +80,42 @@ This guide provides instructions for deploying and benchmarking a TensorRT Large 1. **Initialize Terraform Variables:** - Create a `terraform.tfvars` file by copying the provided example: + Start by creating a `terraform.tfvars` file from the provided template: ```bash cp sample-terraform.tfvars terraform.tfvars ``` -2. Define `template_path`, `image_path` and `gpu_count` variables in `terraform.tfvars`. - - * If using method 1 In Step 1 above: - ```bash - template_path = "path_to_manifest_template/triton-tensorrtllm-inference.tftpl" - image_path = "path_to_your_registry/tritonserver_llm:latest" - gpu_count = X - ``` - * If using method 2 In Step 1 above: +2. **Specify Essential Variables** - ```bash - template_path = "path_to_manifest_template/triton-tensorrtllm-inference-gs.tftpl" - image_path = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3" - gpu_count = X + At a minimum, configure the `gcs_model_path` in `terraform.tfvars` to point to your Google Cloud Storage (GCS) repository. You may also need to adjust `image_path`, `gpu_count`, and `server_launch_command_string` according to your specific requirements. + ```bash + gcs_model_path = gs://path_to_model_repo/all_models ``` - and also update the `triton-tensorrtllm-inference-gs.tftpl` with the path to the gcs repo under InitContainer and the command to launch the container under Container section. - -3. **Configure Your Deployment:** - Edit the `terraform.tfvars` file to include your specific configuration details with the variable `credentials_config`. - Example + If `gcs_model_path` is not defined, it is inferred that you are utilizing the method of direct image integration. In this scenario, ensure to update `image_path` along with any pertinent variables accordingly: - ```bash - credentials_config = { - kubeconfig = "path/to/your/gcloud/credentials.json" - } + ```bash + image_path = "path_to_your_registry/tritonserver_llm:latest" ``` -#### [optional] set-up credentials config with kubeconfig +3. **Configure Your Deployment:** + + Edit `terraform.tfvars` to tailor your deployment's configuration, particularly the `credentials_config`. Depending on your cluster's setup, this might involve specifying fleet host credentials or isolating your cluster's kubeconfig: -If you created your cluster with steps from `../../infra/` or with fleet management enabled, the existing `credentials_config` must use the fleet host credentials like this: + ***For Fleet Management Enabled Clusters:*** ```bash credentials_config = { fleet_host = "https://connectgateway.googleapis.com/v1/projects/$PROJECT_NUMBER/locations/global/gkeMemberships/$CLUSTER_NAME" } -``` - - -If you created your own cluster without fleet management enabled, you can use your cluster's kubeconfig in the `credentials_config`. You must isolate your cluster's kubeconfig from other clusters in the default kube.config file. To do this, run the following command: - +``` + ***For Clusters without Fleet Management:*** +Separate your cluster's kubeconfig: ```bash KUBECONFIG=~/.kube/${CLUSTER_NAME}-kube.config gcloud container clusters get-credentials $CLUSTER_NAME --location $CLUSTER_LOCATION ``` -Then update your `terraform.tfvars` `credentials_config` to the following: +And then update your `terraform.tfvars` accordingly: ```bash credentials_config = { @@ -143,7 +125,7 @@ credentials_config = { } ``` -#### [optional] set up secret token in Secret Manager +#### [optional] Setting Up Secret Tokens A model may require a security token to access it. For example, Llama2 from HuggingFace is a gated model that requires a [user access token](https://huggingface.co/docs/hub/en/security-tokens). If the model you want to run does not require this, skip this step. @@ -152,7 +134,8 @@ If you followed steps from `../../infra/stage-2`, Secret Manager and the user ac kubectl create secret generic huggingface-secret --from-literal=token='************' ``` -This command creates a new Secret named huggingface-secret, which has a key token containing your Hugging Face CLI token. +Executing this command generates a new Secret named huggingface-secret, which incorporates a key named token that stores your Hugging Face CLI token. It is important to note that for any production or shared environments, directly storing user access tokens as literals is not advisable. + ## Step 3: login to gcloud @@ -183,12 +166,16 @@ terraform apply |----------------------|-----------------------------------------------------------------------------------------------|---------|-------------------------------------------|----------| | `credentials_config` | Configure how Terraform authenticates to the cluster. | Object | | No | | `namespace` | Namespace used for Nvidia DCGM resources. | String | `"default"` | No | -| `image_path` | Image Path stored in Artifact Registry | String | | No | +| `image_path` | Image Path stored in Artifact Registry | String | "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3" | No | | `model_id` | Model used for inference. | String | `"meta-llama/Llama-2-7b-chat-hf"` | No | | `gpu_count` | Parallelism based on number of gpus. | Number | `1` | No | | `ksa` | Kubernetes Service Account used for workload. | String | `"default"` | No | | `huggingface-secret` | Name of the kubectl huggingface secret token | String | `"huggingface-secret"` | Yes | -| `templates_path` | Path where manifest templates will be read from. | String | | No | +| `gcs_model_path` | Path where model engine in gcs will be read from. | String | "" | No | +| `server_launch_command_string` | Command to launc the Triton Inference Server | String | "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :" | No | + + + ## Notes diff --git a/benchmarks/inference-server/triton/main.tf b/benchmarks/inference-server/triton/main.tf index 92f129270..21093c6ef 100644 --- a/benchmarks/inference-server/triton/main.tf +++ b/benchmarks/inference-server/triton/main.tf @@ -14,13 +14,19 @@ * limitations under the License. */ + locals { + template_path = var.gcs_model_path == null ? "${path.module}/manifest-templates/triton-tensorrtllm-inference-docker.tftpl" : "${path.module}/manifest-templates/triton-tensorrtllm-inference-gs.tftpl" +} + resource "kubernetes_manifest" "default" { - manifest = yamldecode(templatefile(var.template_path, { - namespace = var.namespace - ksa = var.ksa - image_path = var.image_path - huggingface-secret = var.huggingface-secret - gpu_count = var.gpu_count - model_id = var.model_id + manifest = yamldecode(templatefile(local.template_path, { + namespace = var.namespace + ksa = var.ksa + image_path = var.image_path + huggingface-secret = var.huggingface-secret + gpu_count = var.gpu_count + model_id = var.model_id + gcs_model_path = var.gcs_model_path + server_launch_command_string = var.server_launch_command_string })) } diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl similarity index 100% rename from benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference.tftpl rename to benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-docker.tftpl diff --git a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl index 7f0c679f9..30940a4cd 100644 --- a/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl +++ b/benchmarks/inference-server/triton/manifest-templates/triton-tensorrtllm-inference-gs.tftpl @@ -43,7 +43,7 @@ spec: serviceAccountName: ${ksa} command: ["/bin/sh", "-c"] args: - - gsutil cp -r gs://your_gcs_repo/all_models ./; + - gsutil cp -r ${gcs_model_path} ./; volumeMounts: - name: all-models-volume mountPath: /all_models @@ -53,7 +53,7 @@ spec: workingDir: /opt/tritonserver #command: ["/bin/sleep", "3600"] command: ["/bin/bash", "-c"] - args: ["pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :"] + args: ["${server_launch_command_string}"] ports: - containerPort: 8000 name: http-triton @@ -73,4 +73,6 @@ spec: serviceAccountName: ${ksa} volumeMounts: - name: all-models-volume - mountPath: /all_models \ No newline at end of file + mountPath: /all_models + - mountPath: /dev/shm + name: dshm \ No newline at end of file diff --git a/benchmarks/inference-server/triton/sample-terraform.tfvars b/benchmarks/inference-server/triton/sample-terraform.tfvars index 4b9a4029d..5372d8d7a 100644 --- a/benchmarks/inference-server/triton/sample-terraform.tfvars +++ b/benchmarks/inference-server/triton/sample-terraform.tfvars @@ -7,4 +7,4 @@ ksa = "benchmark-ksa" model_id = "meta-llama/Llama-2-7b-chat-hf" gpu_count = 1 image_path = "" -template_path = "" \ No newline at end of file +gcs_model_path = "" \ No newline at end of file diff --git a/benchmarks/inference-server/triton/variables.tf b/benchmarks/inference-server/triton/variables.tf index c77732139..51d65eaeb 100644 --- a/benchmarks/inference-server/triton/variables.tf +++ b/benchmarks/inference-server/triton/variables.tf @@ -34,7 +34,7 @@ variable "credentials_config" { } variable "namespace" { - description = "Namespace used for Nvidia DCGM resources." + description = "Namespace used for Nvidia Triton resources." type = string nullable = false default = "default" @@ -44,6 +44,7 @@ variable "image_path" { description = "Image Path stored in Artifact Registry" type = string nullable = false + default = "nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3" } variable "model_id" { @@ -74,9 +75,17 @@ variable "huggingface-secret" { default = "huggingface-secret" } -variable "template_path" { - description = "Path where manifest templates will be read from." +variable "gcs_model_path" { + description = "Path to the GCS repo where model is stored" type = string - nullable = false + nullable = true + default = null +} + +variable "server_launch_command_string" { + description = "command to launch the triton server" + type = string + nullable = true + default = "pip install sentencepiece protobuf && huggingface-cli login --token $HUGGINGFACE_TOKEN && /opt/tritonserver/bin/tritonserver --model-repository=/all_models/inflight_batcher_llm --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_" }