0.11 release blog (#310)

* Add 0.11 release blog Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Update blog Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Add vllm runtime doc Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Add vllm example doc Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Update blog link Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Add vLLM intro Signed-off-by: Dan Sun <dsun20@bloomberg.net> * add python runtime open inference protocol tutorials Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Fix warning Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Add warning Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Address comments Signed-off-by: Dan Sun <dsun20@bloomberg.net> * Fix newline Signed-off-by: Dan Sun <dsun20@bloomberg.net> --------- Signed-off-by: Dan Sun <dsun20@bloomberg.net>
kserve · Nov 2, 2023 · c8f6a1e · c8f6a1e
1 parent 551763b
commit c8f6a1e
Show file tree

Hide file tree

Showing 6 changed files with 538 additions and 1 deletion.
diff --git a/docs/blog/articles/2023-10-08-KServe-0.11-release.md b/docs/blog/articles/2023-10-08-KServe-0.11-release.md
@@ -0,0 +1,143 @@
+# Announcing: KServe v0.11
+
+We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. 
+For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.
+
+
+Here is a summary of the key changes:
+
+## KServe Core Inference Enhancements
+
+- Support path based routing which is served as an alternative way to the host based routing, the URL of the `InferenceService` could look like `http://<ingress_domain>/serving/<namespace>/<isvc_name>`. 
+  Please refer to the [doc](https://github.com/kserve/kserve/blob/294a10495b6b5cda9c64d3e1573b60aec62aceb9/config/configmap/inferenceservice.yaml#L237) for how to enable path based routing.
+
+- Introduced priority field for `Serving Runtime` custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from [the serving runtime doc](https://kserve.github.io/website/0.11/modelserving/servingruntimes/#priority).
+
+- Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:
+  ```yaml
+    apiVersion: "serving.kserve.io/v1alpha1"
+    kind: ClusterStorageContainer
+    metadata:
+      name: default
+    spec:
+      container:
+        name: storage-initializer
+        image: kserve/model-registry:latest
+        resources:
+          requests:
+            memory: 100Mi
+            cpu: 100m
+          limits:
+            memory: 1Gi
+            cpu: "1"
+      supportedUriFormats:
+        - prefix: model-registry://
+  ```
+
+- Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields.
+  `Dependency` field with options `Soft` and `Hard` is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:
+
+  ```yaml
+    apiVersion: serving.kserve.io/v1alpha1
+    kind: InferenceGraph
+    metadata:
+      name: graph_with_switch_node
+    spec:
+      nodes:
+        root:
+          routerType: Sequence
+          steps:
+            - name: "rootStep1"
+              nodeName: node1
+              dependency: Hard
+            - name: "rootStep2"
+              serviceName: {{ success_200_isvc_id }}
+        node1:
+          routerType: Switch
+          steps:
+            - name: "node1Step1"
+              serviceName: {{ error_404_isvc_id }}
+              condition: "[@this].#(decision_picker==ERROR)"
+              dependency: Hard
+  ```
+  For more details please refer to the [issue](https://github.com/kserve/kserve/issues/2484).
+
+- Improved InferenceService debugging experience by adding the aggregated `RoutesReady` status and `LastDeploymentReady` condition to the InferenceService Status to differentiate the endpoint and deployment status.
+  This applies to the serverless mode and for more details refer to the [API docs](https://pkg.go.dev/github.com/kserve/kserve@v0.11.1/pkg/apis/serving/v1beta1#InferenceServiceStatus).
+
+### Enhanced Python SDK Dependency Management
+
+- KServe has adopted [poetry](https://python-poetry.org/docs/) to manage python dependencies. You can now install the KServe SDK with locked dependencies using `poetry install`. 
+While `pip install` still works,  we highly recommend using poetry to ensure predictable dependency management.
+
+- The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with `pip install kserve[storage]`.
+
+
+### KServe Python Runtimes Improvements
+- KServe Python Runtimes including [sklearnserver](../../modelserving/v1beta1/sklearn/v2/README.md), [lgbserver](../../modelserving/v1beta1/lightgbm/README.md), [xgbserver](../../modelserving/v1beta1/xgboost/README.md)
+  now support the open inference protocol for both REST and gRPC.
+
+- Logging improvements including adding Uvicorn access logging and a default KServe logger.
+
+- `Postprocess` handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.
+
+
+### LLM Runtimes
+
+### TorchServe LLM Runtime
+KServe now integrates with TorchServe 0.8, offering the support for [LLM models](https://pytorch.org/serve/large_model_inference.html) that may not fit onto a single GPU. 
+Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the [detailed example](../../modelserving/v1beta1/llm/) for how to serve the LLM on KServe with TorchServe runtime.
+
+### vLLM Runtime
+Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. 
+It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization,
+[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.
+
+In the [example](../../modelserving/v1beta1/llm/vllm/README.md) we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed [generate endpoint](https://github.com/kserve/open-inference-protocol/pull/7) for open inference protocol. 
+
+## ModelMesh Updates
+
+### Storing Models on Kubernetes Persistent Volumes (PVC)
+ModelMesh now allows to [directly mount model files onto serving runtimes pods](https://github.com/kserve/modelmesh-serving/blob/main/docs/predictors/setup-storage.md#deploy-a-model-stored-on-a-persistent-volume-claim) 
+using [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/). Depending on the selected [storage solution](https://kubernetes.io/docs/concepts/storage/storage-classes/) this approach can significantly reduce latency when deploying new predictors, 
+potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.
+
+
+### Horizontal Pod Autoscaling (HPA)
+Kubernetes [Horizontal Pod Autoscaling](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a `HorizontalPodAutoscaler` automatically updates the serving
+runtime deployment with the number of Pods to best match the demand.
+
+### Model Metrics, Metrics Dashboard, Payload Event Logging
+ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.
+
+A new [Grafana dashboard](https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md#import-the-grafana-dashboard) was added to display the comprehensive set of [Prometheus metrics](https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md) like model loading
+and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.
+
+The new [`PayloadProcessor` interface](https://github.com/kserve/modelmesh/blob/main/src/main/java/com/ibm/watson/modelmesh/payload/) can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.
+
+## What's Changed? :warning:
+- To allow longer InferenceService name due to DNS max length limits from [issue](https://github.com/kserve/kserve/issues/1397), the `Default` suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. 
+  This affects the client that is using the component url directly instead of the top level InferenceService url.
+
+- Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.
+
+- Raw bytes are now accepted in v1 protocol, setting the right content-type header to `application/json` is required to recognize and decode the json payload if `content-type` is specified.
+```bash
+curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json
+```
+
+
+For a complete change list please read the release notes from [KServe v0.11](https://github.com/kserve/kserve/releases/tag/v0.11.0) and
+[ModelMesh v0.11](https://github.com/kserve/modelmesh-serving/releases/tag/v0.11.0).
+
+## Join the community
+
+- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve)
+- Join the Slack ([#kserve](https://kubeflow.slack.com/?redir=%2Farchives%2FCH6E58LNP))
+- Attend our community meeting by subscribing to the [KServe calendar](https://wiki.lfaidata.foundation/display/kserve/calendars).
+- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!
+
+
+Thanks for all the contributors who have made the commits to 0.11 release!
+
+The KServe Working Group
diff --git a/docs/modelserving/v1beta1/llm/vllm/README.md b/docs/modelserving/v1beta1/llm/vllm/README.md
@@ -0,0 +1,81 @@
+## Deploy the LLaMA model with vLLM Runtime
+Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers.
+It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization,
+[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.
+
+You can deploy the LLaMA model with built vLLM inference server container image using the `InferenceService` yaml API spec. 
+We have work in progress integrating `vLLM` with `Open Inference Protocol` and KServe observability stack.
+
+The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta-llama/Llama-2-7b) and upload to your cloud storage.
+
+=== "Yaml"
+    ```yaml
+    kubectl apply -n kserve-test -f - <<EOF
+    apiVersion: serving.kserve.io/v1beta1
+    kind: InferenceService
+    metadata:
+      name: llama-2-7b
+    spec:
+      predictor:
+        containers:
+          - args:
+            - --port
+            - "8080"
+            - --model
+            - /mnt/models
+          command:
+            - python3
+            - -m
+            - vllm.entrypoints.api_server
+          env:
+            - name: STORAGE_URI
+              value: gcs://kfserving-examples/llm/huggingface/llama
+          image: kserve/vllmserver:latest
+          name: kserve-container
+          resources:
+            limits:
+              cpu: "4"
+              memory: 50Gi
+              nvidia.com/gpu: "1"
+            requests:
+              cpu: "1"
+              memory: 50Gi
+              nvidia.com/gpu: "1"
+    ```
+
+!!! Warning
+    vLLM runtime is still experimental, please expect API changes and further integration in the next KServe release.
+
+=== "kubectl"
+```bash
+kubectl apply -f ./vllm.yaml
+```
+
+## Benchmarking vLLM Runtime
+
+You can download the benchmark testing data set by running
+```bash
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+
+The tokenizer can be found from the downloaded llama model.
+
+Now, assuming that your ingress can be accessed at
+`${INGRESS_HOST}:${INGRESS_PORT}` or you can follow [this instruction](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports)
+to find out your ingress IP and port.
+
+You can run the [benchmarking script](./benchmark.py) and send the inference request to the exposed URL.
+
+```bash
+python benchmark.py --backend vllm --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
+```
+
+!!! success "Expected Output"
+
+    ```{ .json .no-copy }
+       Total time: 216.81 s
+       Throughput: 4.61 requests/s
+       Average latency: 7.96 s
+       Average latency per token: 0.02 s
+       Average latency per output token: 0.04 s
+    ```