Skip to content

mindspore-ai/llm-serving

Repository files navigation

MindSpore LLM-Serving

serving is a fast and easy-to-use LLM inference framework


Features

  • model parallel deployment
  • token streaming via Server-Sent Events
  • custom model inputs
  • static/continuous batching of prompts
  • do post sampling via npu
  • PagedAttention
  • kbk inference

Supports the most popular LLMs, including the following architectures:

  • LLaMA-2
  • InternLM
  • baichuan2
  • wizardcoder

Get Started

环境依赖

一键安装whl包

bash build.sh
pip install mindspore-serving-xxx.whl

注:后处理当前按照入图的方式进行,使用serving前请使用post_sampling_model.py重新导出后处理模型,保证数据类型与LLM模型的输出类型一致;

python tools/post_sampling_model.py --output_dir ./target_dir
# args
#   output_dir: 后处理模型生成的目录

修改模型对应的配置文件

带PagedAttention配置
yaml文件

在模型对应的配置文件configs/模型名称/xxx.yaml中,用户可自行修改模型,并通过page_attention开启PA的模型训练(True为启动模型PA功能,并在后面添加pa_config的设置项,具体参数根据模型来设置)

model_path:
    prefill_model: ["/path/to/baichuan2/output_serving/mindir_full_checkpoint/rank_0_graph.mindir"]
    decode_model: ["/path/to/baichuan2/output_serving/mindir_inc_checkpoint/rank_0_graph.mindir"]
    argmax_model: "/path/to/serving/target_dir/argmax.mindir"
    topk_model: "/path/to/target_dir/topk.mindir"
    prefill_ini: ["/path/to/baichuan_ini/910b_default_ctx.cfg"]
    decode_ini: ["/path/to/baichuan_ini/910b_default_inc.cfg"]
    post_model_ini: "/path/to/serving/target_dir/config.ini"
model_config:
    model_name: 'baichuan2'
    max_generate_length: 4096
    end_token: 2
    seq_length: [1024, 2048, 4096]	#支持多分档
    vocab_size: 125696
    prefill_batch_size: [1]     #带PA功能只支持单batch
    decode_batch_size: [16]  	#带PA功能只支持多batch,但不支持多分档
    zactivate_len: [4096]
    model_type: 'dyn'
    page_attention: True   # True为启动模型PA功能
    current_index: False
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 0   
    
pa_config:  #带PA的配置此项,根据模型设置参数
    num_blocks: 512     
    block_size: 16
    decode_seq_length: 4096
    
serving_config:
    agent_ports: [61166]
    start_device_id: 0
    server_ip: 'localhost'
    server_port: 61155    
    
tokenizer:
    type: LlamaTokenizer
    vocab_file: '/path/to/llama_pa_models/output/tokenizer_llama2_13b.model'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
prefill_ini
[ascend_context]
provider=ge

[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype

[graph_kernel_param]
opt_level=2
enable_cce_lib=true
disable_cce_lib_ops=MatMul
disable_cluster_ops=MatMul,Reshape

[ge_graph_options]
ge.inputShape=batch_valid_length:1;slot_mapping:-1;tokens:1,-1
ge.dynamicDims=64,64;128,128;256,256;512,512;1024,1024;2048,2048;4096,4096	
#	必须包含yaml文件中model_config的seq_length的档位[1024,2048,4096]
ge.dynamicNodeType=1
decode_ini
[ascend_context]
provider=ge

[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype

[graph_kernel_param]
opt_level=2
enable_cce_lib=true
disable_cce_lib_ops=MatMul
disable_cluster_ops=MatMul,Reshape

[ge_graph_options]
ge.inputShape=batch_valid_length:-1;block_tables:-1,256;slot_mapping:-1;tokens:-1,1		
#	block_tables中的256是根据yaml配置中的pa_config下的decode_seq_length/block_size得来
ge.dynamicDims=1,1,1,1;2,2,2,2;8,8,8,8;16,16,16,16;64,64,64,64	
#	ge.inputShape中有几个“-1”,便每组有几个数(有4个-1,所以有4个1、4个2、4个...),且必须包含yaml配置中的pa_config下decode_batch_size的batch数
ge.dynamicNodeType=1
不带PagedAttention配置
model_path:
    prefill_model: ["/path/to/llama_pa_models/no_act/output_no_act_len/output/mindir_full_checkpoint/rank_0_graph.mindir"]
    decode_model: ["/path/to/llama_pa_models/no_act/output_no_act_len/output/mindir_inc_checkpoint/rank_0_graph.mindir"]
    argmax_model: "/path/to/serving_dev/extends_13b/argmax.mindir"
    topk_model: "/path/to/serving_dev/extends_13b/topk.mindir"
    prefill_ini: ["/path/to/llama_pa_models/no_act/ini/910b_default_prefill.cfg"]
    decode_ini: ["/path/to/llama_pa_models/no_act/ini/910_inc.cfg"]
    post_model_ini: "/path/to/baichuan/congfig/config.ini"
model_config:
    model_name: 'llama_dyn'
    max_generate_length: 8192
    end_token: 2
    seq_length: [512, 1024]
    vocab_size: 32000
    prefill_batch_size: [64]			#不支持多分档,只支持多batch
    decode_batch_size: [1,4,8,16,30,64]	#支持多分档
    zactivate_len: [512]
    model_type: "dyn"	#若无此字段默认为“dyn”,若有此字段需指定model_type
    current_index: False
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 0  
    
serving_config:
    agent_ports: [11330]
    start_device_id: 5
    server_ip: 'localhost'
    server_port: 19200
    
tokenizer:
    type: LlamaTokenizer
    vocab_file: '/path/to/llama_pa_models/output/tokenizer_llama2_13b.model'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
prefill_ini
[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype
decode_ini
[ascend_context]
provider=ge
[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype
[ge_graph_options]
ge.inputShape=batch_index:-1;batch_valid_length:-1;tokens:-1,1;zactivate_len:-1
ge.dynamicDims=1,1,1,512;4,4,4,512;8,8,8,512;16,16,16,512;30,30,30,512;64,64,64,512
#	根据yaml配置中的model_config下的decode_batch_size确定decode档位;“512”根据yaml中的zactivate_len得到
ge.dynamicNodeType=1
WizardCoder配置(静态shape)
model_path:
    prefill_model: ["/path/to/prefill_model.mindir"]
    decode_model: ["/path/to/decode_model.mindir"]
    argmax_model: "/path/to/argmax.mindir"
    topk_model: "/path/to/topk.mindir"
    prefill_ini : ['/path/to/lite.ini']
    decode_ini: ['/path/to/lite.ini']
    post_model_ini: '/path/to/lite.ini'

model_config:
    model_name: 'llama_dyn'
    max_generate_length: 4096
    end_token: 0
    seq_length: [2048]
    vocab_size: 49153
    prefill_batch_size: [1]
    decode_batch_size: [1]
    zactivate_len: [2048]
    model_type: 'static'
    seq_type: 'static'
    batch_waiting_time: 0.0
    decode_batch_waiting_time: 0.0
    batching_strategy: 'continuous'
    current_index: False
    page_attention: False
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 49152

serving_config:
    agent_ports: [9980]
    start_device_id: 0
    server_ip: 'localhost'
    server_port: 12359

tokenizer:
    type: WizardCoderTokenizer
    vocab_file: '/path/to/transformers_config'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs
lite_ini
[ascend_context]
plugin_custom_ops=All
provider=ge
[ge_session_options]
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
支持mindspore kbk 训推一体化 yaml 配置
带PA yaml 配置
model_path:
    prefill_model: ["/path/llama2-13b-mindir/full_graph.mindir"]
    decode_model: ["/path/llama2-13b-mindir/inc_graph.mindir"]
    argmax_model: "/path/post_process/argmax.mindir"
    topk_model: "/path/post_process/topk.mindir"
    prefill_ini : ['/path/llma2_13b_pa_dyn_prefill.ini']
    decode_ini: ['/path/llma2_13b_pa_dyn_decode.ini']
    post_model_ini: '/path/post_config.ini'

model_config:
    model_name: 'llama_7b'
    max_generate_length: 4096
    end_token: 2
    seq_length: [4096]
    vocab_size: 32000
    prefill_batch_size: [1]
    decode_batch_size: [1]
    zactivate_len: [512, 1024, 2048, 4096]
    model_type: 'dyn'
    seq_type: 'static'
    batch_waiting_time: 0.0
    decode_batch_waiting_time: 0.0
    batching_strategy: 'continuous'
    current_index: False
    page_attention: True
    model_dtype: "DataType.FLOAT32"
    pad_token_id: 0
    backend: 'kbk' # 'ge'
    model_cfg_path: 'checkpoint_download/llama/llama_7b.yaml'

serving_config:
    agent_ports: [14002]
    start_device_id: 7
    server_ip: 'localhost'
    server_port: 19291

pa_config:
    num_blocks: 2048
    block_size: 16
    decode_seq_length: 4096

tokenizer:
    type: LlamaTokenizer
    vocab_file: '/home/wsc/llama/tokenizer/tokenizer.model'

basic_inputs:
    type: LlamaBasicInputs

extra_inputs:
    type: LlamaExtraInputs

warmup_inputs:
    type: LlamaWarmupInputs

设置环境变量,变量配置如下

方式一:使用已有脚本启动
source /path/to/xxx-serving.sh
export PYTHONPATH=/path/to/mindformers-ft-2:/path/to/serving/:$PYTHONPATH
方式二:镜像

下载好docker镜像后创建容器

# --device用于控制指定容器的运行NPU卡号和范围
# -v 用于映射容器外的目录
# --name 用于自定义容器名称
# /bin/bash前的是镜像ID,可以用指令docker images查看

docker run -it -u root \
--ipc=host \
--network host \
--device=/dev/davinci0 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /etc/localtime:/etc/localtime \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /var/log/npu/:/usr/slog \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
--name {请手动输入容器名称} \
XXX /bin/bash
export PYTHONPATH=/path/to/mindformers-ft-2:/path/to/serving/:$PYTHONPATH

启动

python examplse/start.py --config configs/xxx.yaml# 先后拉起模型和serving进程

启动参数:config: 模型对应的yaml文件, refer to model.yaml

发起请求

通过“/models/model_name/generate”和"/models/model_name/generate_stream" 进行请求

curl 127.0.0.1:9800/models/llama2/generate \
     -X POST \
     -d '{"inputs":"Hello?","parameters":{"max_new_tokens":55, "do_sample":"False", "return_full_text":"True"}, "stream":"True"}' \
     -H 'Content-Type: application/json'
curl 127.0.0.1:9800/models/llama2/generate_stream \
     -X POST \
     -d '{"inputs":"Hello?","parameters":{"max_new_tokens":55, "do_sample":"False", "return_full_text":"True"}, "stream":"True"}' \
     -H 'Content-Type: application/json'

或者通过python API

from mindspore_serving.client import MindsporeInferenceClient

client = MindsporeInferenceClient(model_type="llama2", server_url="http://127.0.0.1:8080")

# 1. test generate
text = client.generate("what is Monetary Policy?").generated_text
print('text: ', text)

# 2. test generate_stream
text = ""
for response in client.generate_stream("what is Monetary Policy?", do_sample=False, max_new_tokens=200):
    print("response 0", response)
    if response.token:
        text += response.token.text
    else:
        text = response.generated_text
print(text)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published