Skip to content

Commit

Permalink
0.11 release blog (#310)
Browse files Browse the repository at this point in the history
* Add 0.11 release blog

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Update blog

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Add vllm runtime doc

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Add vllm example doc

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Update blog link

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Add vLLM intro

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* add python runtime open inference protocol tutorials

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Fix warning

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Add warning

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Address comments

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Fix newline

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

---------

Signed-off-by: Dan Sun <dsun20@bloomberg.net>
  • Loading branch information
yuzisun authored Nov 2, 2023
1 parent 551763b commit c8f6a1e
Show file tree
Hide file tree
Showing 6 changed files with 538 additions and 1 deletion.
143 changes: 143 additions & 0 deletions docs/blog/articles/2023-10-08-KServe-0.11-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Announcing: KServe v0.11

We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet.
For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.


Here is a summary of the key changes:

## KServe Core Inference Enhancements

- Support path based routing which is served as an alternative way to the host based routing, the URL of the `InferenceService` could look like `http://<ingress_domain>/serving/<namespace>/<isvc_name>`.
Please refer to the [doc](https://github.com/kserve/kserve/blob/294a10495b6b5cda9c64d3e1573b60aec62aceb9/config/configmap/inferenceservice.yaml#L237) for how to enable path based routing.

- Introduced priority field for `Serving Runtime` custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from [the serving runtime doc](https://kserve.github.io/website/0.11/modelserving/servingruntimes/#priority).

- Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:
```yaml
apiVersion: "serving.kserve.io/v1alpha1"
kind: ClusterStorageContainer
metadata:
name: default
spec:
container:
name: storage-initializer
image: kserve/model-registry:latest
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 1Gi
cpu: "1"
supportedUriFormats:
- prefix: model-registry://
```
- Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields.
`Dependency` field with options `Soft` and `Hard` is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
name: graph_with_switch_node
spec:
nodes:
root:
routerType: Sequence
steps:
- name: "rootStep1"
nodeName: node1
dependency: Hard
- name: "rootStep2"
serviceName: {{ success_200_isvc_id }}
node1:
routerType: Switch
steps:
- name: "node1Step1"
serviceName: {{ error_404_isvc_id }}
condition: "[@this].#(decision_picker==ERROR)"
dependency: Hard
```
For more details please refer to the [issue](https://github.com/kserve/kserve/issues/2484).

- Improved InferenceService debugging experience by adding the aggregated `RoutesReady` status and `LastDeploymentReady` condition to the InferenceService Status to differentiate the endpoint and deployment status.
This applies to the serverless mode and for more details refer to the [API docs](https://pkg.go.dev/github.com/kserve/kserve@v0.11.1/pkg/apis/serving/v1beta1#InferenceServiceStatus).

### Enhanced Python SDK Dependency Management

- KServe has adopted [poetry](https://python-poetry.org/docs/) to manage python dependencies. You can now install the KServe SDK with locked dependencies using `poetry install`.
While `pip install` still works, we highly recommend using poetry to ensure predictable dependency management.

- The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with `pip install kserve[storage]`.


### KServe Python Runtimes Improvements
- KServe Python Runtimes including [sklearnserver](../../modelserving/v1beta1/sklearn/v2/README.md), [lgbserver](../../modelserving/v1beta1/lightgbm/README.md), [xgbserver](../../modelserving/v1beta1/xgboost/README.md)
now support the open inference protocol for both REST and gRPC.

- Logging improvements including adding Uvicorn access logging and a default KServe logger.

- `Postprocess` handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.


### LLM Runtimes

### TorchServe LLM Runtime
KServe now integrates with TorchServe 0.8, offering the support for [LLM models](https://pytorch.org/serve/large_model_inference.html) that may not fit onto a single GPU.
Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the [detailed example](../../modelserving/v1beta1/llm/) for how to serve the LLM on KServe with TorchServe runtime.

### vLLM Runtime
Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers.
It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization,
[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

In the [example](../../modelserving/v1beta1/llm/vllm/README.md) we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed [generate endpoint](https://github.com/kserve/open-inference-protocol/pull/7) for open inference protocol.

## ModelMesh Updates

### Storing Models on Kubernetes Persistent Volumes (PVC)
ModelMesh now allows to [directly mount model files onto serving runtimes pods](https://github.com/kserve/modelmesh-serving/blob/main/docs/predictors/setup-storage.md#deploy-a-model-stored-on-a-persistent-volume-claim)
using [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/). Depending on the selected [storage solution](https://kubernetes.io/docs/concepts/storage/storage-classes/) this approach can significantly reduce latency when deploying new predictors,
potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.


### Horizontal Pod Autoscaling (HPA)
Kubernetes [Horizontal Pod Autoscaling](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a `HorizontalPodAutoscaler` automatically updates the serving
runtime deployment with the number of Pods to best match the demand.

### Model Metrics, Metrics Dashboard, Payload Event Logging
ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.

A new [Grafana dashboard](https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md#import-the-grafana-dashboard) was added to display the comprehensive set of [Prometheus metrics](https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md) like model loading
and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.

The new [`PayloadProcessor` interface](https://github.com/kserve/modelmesh/blob/main/src/main/java/com/ibm/watson/modelmesh/payload/) can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.

## What's Changed? :warning:
- To allow longer InferenceService name due to DNS max length limits from [issue](https://github.com/kserve/kserve/issues/1397), the `Default` suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices.
This affects the client that is using the component url directly instead of the top level InferenceService url.

- Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.

- Raw bytes are now accepted in v1 protocol, setting the right content-type header to `application/json` is required to recognize and decode the json payload if `content-type` is specified.
```bash
curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json
```


For a complete change list please read the release notes from [KServe v0.11](https://github.com/kserve/kserve/releases/tag/v0.11.0) and
[ModelMesh v0.11](https://github.com/kserve/modelmesh-serving/releases/tag/v0.11.0).

## Join the community

- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve)
- Join the Slack ([#kserve](https://kubeflow.slack.com/?redir=%2Farchives%2FCH6E58LNP))
- Attend our community meeting by subscribing to the [KServe calendar](https://wiki.lfaidata.foundation/display/kserve/calendars).
- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!


Thanks for all the contributors who have made the commits to 0.11 release!

The KServe Working Group
81 changes: 81 additions & 0 deletions docs/modelserving/v1beta1/llm/vllm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
## Deploy the LLaMA model with vLLM Runtime
Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers.
It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization,
[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

You can deploy the LLaMA model with built vLLM inference server container image using the `InferenceService` yaml API spec.
We have work in progress integrating `vLLM` with `Open Inference Protocol` and KServe observability stack.

The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta-llama/Llama-2-7b) and upload to your cloud storage.

=== "Yaml"
```yaml
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-2-7b
spec:
predictor:
containers:
- args:
- --port
- "8080"
- --model
- /mnt/models
command:
- python3
- -m
- vllm.entrypoints.api_server
env:
- name: STORAGE_URI
value: gcs://kfserving-examples/llm/huggingface/llama
image: kserve/vllmserver:latest
name: kserve-container
resources:
limits:
cpu: "4"
memory: 50Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 50Gi
nvidia.com/gpu: "1"
```

!!! Warning
vLLM runtime is still experimental, please expect API changes and further integration in the next KServe release.

=== "kubectl"
```bash
kubectl apply -f ./vllm.yaml
```

## Benchmarking vLLM Runtime

You can download the benchmark testing data set by running
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

The tokenizer can be found from the downloaded llama model.

Now, assuming that your ingress can be accessed at
`${INGRESS_HOST}:${INGRESS_PORT}` or you can follow [this instruction](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports)
to find out your ingress IP and port.

You can run the [benchmarking script](./benchmark.py) and send the inference request to the exposed URL.

```bash
python benchmark.py --backend vllm --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
```

!!! success "Expected Output"

```{ .json .no-copy }
Total time: 216.81 s
Throughput: 4.61 requests/s
Average latency: 7.96 s
Average latency per token: 0.02 s
Average latency per output token: 0.04 s
```
Loading

0 comments on commit c8f6a1e

Please sign in to comment.