Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024OSPP——基于AI网关实现AI模型的轻量化部署 #1

Conversation

YJQ1101
Copy link

@YJQ1101 YJQ1101 commented Aug 28, 2024

简介

该功能实现在诸如个人电脑、普通的云主机上基于AI模型提供AI Web应用能力,具体来说,支持从本地文件加载模型,并解析HTTP请求,调用模型生成结果响应。https://summer-ospp.ac.cn/org/prodetail/241f80032?lang=zh&list=pro

实现

实现了一个 HTTP Filter,该filter会解析推理请求,并调用异步推理线程实现推理过程,同时给该异步线程一个回调函数,实现envoy的流式传输。

编译与运行
bazel build //contrib/exe:envoy-static #编译
bazel-bin/contrib/exe/envoy-static -c "envoy.yaml"  --concurrency 1 #运行
配置使用

filter级配置:

- name: envoy.filters.http.llm_inference
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelParameter
    n_threads : 100
    n_parallel : 5
    chat_modelpath: {
      "qwen2": "/home/yuanjq/model/qwen2-7b-instruct-q5_k_m.gguf",
      "llama3": "/home/yuanjq/model/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
    }
    embedding_modelpath: {
      "bge": "/home/yuanjq/model/bge-small-zh-v1.5-f32.gguf"
    }

n_threads: 表示推理线程能用的最大线程数
n_parallel: 表示推理服务的最大并行请求数
chat_modelpath: 表示chat模型本地路径
embedding_modelpath: 表示embedding模型本地路径

router级配置:

route_config:
  name: route
  virtual_hosts:
  - name: llm_inference_service
    domains: ["api.openai.com"]
    routes:
    - match:
        prefix: "/v1/chat/completions"
      typed_per_filter_config:
        envoy.filters.http.llm_inference:
          "@type": type.googleapis.com/envoy.extensions.filters.http.llm_inference.v3.modelChosen
          usemodel: "qwen2"
          first_byte_timeout : 4
          inference_timeout : 90
      direct_response:
        status: 504
        body:
          inline_string: "inference timeout"

usemodel: 表示使用的模型,模型名字与modelpath里面设置的要对应
first_byte_timeout: 表示首字节超时时间
inference_timeout: 表示总推理超时时间

支持模型
效果图

2024-09-2520 38 13-ezgif com-video-to-gif-converter

@CLAassistant
Copy link

CLAassistant commented Aug 28, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

@johnlanni johnlanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以加个 README.md 说明下 filter 的配置使用方式,使用注意事项等。性能也建议和ollama对比一下,看下相同模型相同问题相同并发下的资源开销对比,响应延时对比。

…and LoadEmbedding, Change LOG_INFO to ENVOY_LOG
contrib/llm_inference/filters/http/README.md Outdated Show resolved Hide resolved
contrib/llm_inference/filters/http/README.md Outdated Show resolved Hide resolved
contrib/llm_inference/filters/http/README.md Outdated Show resolved Hide resolved
contrib/llm_inference/filters/http/README.md Show resolved Hide resolved
contrib/llm_inference/filters/http/README.md Show resolved Hide resolved
Copy link

@johnlanni johnlanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@johnlanni johnlanni merged commit b354184 into higress-group:envoy-1.27 Oct 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants