2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

YJQ1101 · 2024-08-28T11:09:18Z

简介

该功能实现在诸如个人电脑、普通的云主机上基于AI模型提供AI Web应用能力，具体来说，支持从本地文件加载模型，并解析HTTP请求，调用模型生成结果响应。https://summer-ospp.ac.cn/org/prodetail/241f80032?lang=zh&list=pro

实现

实现了一个 HTTP Filter，该filter会解析推理请求，并调用异步推理线程实现推理过程，同时给该异步线程一个回调函数，实现envoy的流式传输。

编译与运行

bazel build //contrib/exe:envoy-static #编译
bazel-bin/contrib/exe/envoy-static -c "envoy.yaml"  --concurrency 1 #运行

配置使用

filter级配置：

- name: envoy.filters.http.llm_inference
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelParameter
    n_threads : 100
    n_parallel : 5
    chat_modelpath: {
      "qwen2": "/home/yuanjq/model/qwen2-7b-instruct-q5_k_m.gguf",
      "llama3": "/home/yuanjq/model/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
    }
    embedding_modelpath: {
      "bge": "/home/yuanjq/model/bge-small-zh-v1.5-f32.gguf"
    }

n_threads: 表示推理线程能用的最大线程数
n_parallel: 表示推理服务的最大并行请求数
chat_modelpath: 表示chat模型本地路径
embedding_modelpath: 表示embedding模型本地路径

router级配置：

route_config:
  name: route
  virtual_hosts:
  - name: llm_inference_service
    domains: ["api.openai.com"]
    routes:
    - match:
        prefix: "/v1/chat/completions"
      typed_per_filter_config:
        envoy.filters.http.llm_inference:
          "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelChosen
          usemodel: "qwen2"
          first_byte_timeout : 4
          inference_timeout : 90
      direct_response:
        status: 504
        body:
          inline_string: "inference timeout"

usemodel: 表示使用的模型，模型名字与modelpath里面设置的要对应
first_byte_timeout: 表示首字节超时时间
inference_timeout: 表示总推理超时时间

支持模型

效果图

CLAassistant · 2024-08-28T11:09:25Z

All committers have signed the CLA.

contrib/llm_inference/filters/http/source/config.cc

contrib/llm_inference/filters/http/source/config.h

contrib/llm_inference/filters/http/source/llm_inference_filter.cc

contrib/llm_inference/filters/http/source/config.cc

…uration

…f request interrupt kvcache release twice

johnlanni

可以加个 README.md 说明下 filter 的配置使用方式，使用注意事项等。性能也建议和ollama对比一下，看下相同模型相同问题相同并发下的资源开销对比，响应延时对比。

contrib/llm_inference/filters/http/source/config.cc

contrib/llm_inference/filters/http/source/inference/inference_context.cc

…and LoadEmbedding, Change LOG_INFO to ENVOY_LOG

contrib/llm_inference/filters/http/README.md

johnlanni

LGTM

YJQ1101 added 5 commits August 26, 2024 07:50

feature: llm_inference_filter

60ea17e

add llama.cpp module

d261d60

//contrib/envoy/extensions/upstreams/http/dubbo_tcp/v3:pkg in place

a65e32b

clear yaml file and add test file

0d5f3bb

apply context-shift if needed

77ae6be

Modify path matching to suffix matching and fix sendError bug

a4eace8