Skip to content

Commit

Permalink
Update llm_inference.md
Browse files Browse the repository at this point in the history
Signed-off-by: alabulei1 <vivian.xiage@gmail.com>
  • Loading branch information
alabulei1 authored and juntao committed Jul 29, 2024
1 parent e3cca88 commit e053703
Showing 1 changed file with 31 additions and 57 deletions.
88 changes: 31 additions & 57 deletions docs/develop/rust/wasinn/llm_inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,44 +2,11 @@
sidebar_position: 1
---

# Llama 2 inference

WasmEdge now supports running open source models in Rust. We will use [this example project](https://github.com/second-state/LlamaEdge/tree/main/chat) to show how to make AI inferences with the llama2 model in WasmEdge and Rust.

WasmEdge now supports the following models:

1. Llama-2-7B-Chat
1. Llama-2-13B-Chat
1. CodeLlama-13B-Instruct
1. Mistral-7B-Instruct-v0.1
1. Mistral-7B-Instruct-v0.2
1. MistralLite-7B
1. OpenChat-3.5-0106
1. OpenChat-3.5-1210
1. OpenChat-3.5
1. Wizard-Vicuna-13B-Uncensored-GGUF
1. TinyLlama-1.1B-Chat-v1.0
1. Baichuan2-13B-Chat
1. OpenHermes-2.5-Mistral-7B
1. Dolphin-2.2-Yi-34B
1. Dolphin-2.6-Mistral-7B
1. Samantha-1.2-Mistral-7B
1. Samantha-1.11-CodeLlama-34B
1. WizardCoder-Python-7B-V1.0
1. Zephyr-7B-Alpha
1. WizardLM-13B-V1.0-Uncensored
1. Orca-2-13B
1. Neural-Chat-7B-v3-1
1. Yi-34B-Chat
1. Starling-LM-7B-alpha
1. DeepSeek-Coder-6.7B
1. DeepSeek-LLM-7B-Chat
1. SOLAR-10.7B-Instruct-v1.0
1. Mixtral-8x7B-Instruct-v0.1
1. Nous-Hermes-2-Mixtral-8x7B-DPO
1. Nous-Hermes-2-Mixtral-8x7B-SFT

And more, please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for details.
# LLM inference

WasmEdge now supports running open-source Large Language Models (LLMs) in Rust. We will use [this example project](https://github.com/second-state/LlamaEdge/tree/main/chat) to show how to make AI inferences with the llama-3.1-8B model in WasmEdge and Rust.

Basically, WasmEdge can support any open-source LLMs. Please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for details.

## Prerequisite

Expand All @@ -55,23 +22,23 @@ First, get the latest llama-chat wasm application
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
```

Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llamaedge/blob/main/chat/README.md#get-model).
Next, let's get the model. In this example, we are going to use the llama-3.1-8B model in GGUF format. You can also use other kinds of LLMs, check out [here](https://github.com/second-state/llamaedge/blob/main/chat/README.md#get-model).

```bash
curl -LO https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf
curl -LO https://huggingface.co/second-state/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
```

Run the inference application in WasmEdge.

```bash
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-a-chat
```

After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:

```bash
[USER]:
I have two apples, each costing 5 dollars. What is the total cost of these apple
I have two apples, each costing 5 dollars. What is the total cost of these apples?
[ASSISTANT]:
The total cost of the two apples is 10 dollars.
[USER]:
Expand All @@ -95,19 +62,26 @@ Second, use `cargo` to build the example project.
cargo build --target wasm32-wasi --release
```

The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`. Next, use WasmEdge to load the llama-2-7b model and then ask the model to questions.
The output WASM file is `target/wasm32-wasi/release/llama-chat.wasm`. Next, use WasmEdge to load the llama-3.1-8b model and then ask the model questions.

```bash
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat
```

After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[You]:` prompt:

```bash
[USER]:
Who is Robert Oppenheimer?
[ASSISTANT]:
Robert Oppenheimer was an American theoretical physicist and director of the Manhattan Project, which developed the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century and is known for his contributions to the development of quantum mechanics and the theory of the atomic nucleus. Oppenheimer was also a prominent figure in the post-war nuclear weapons debate and was a strong advocate for international cooperation on nuclear weapons control.
[You]:
Which one is greater? 9.11 or 9.8?

[Bot]:
9.11 is greater.

[You]:
why

[Bot]:
11 is greater than 8.
```

## Options
Expand All @@ -118,13 +92,13 @@ You can configure the chat inference application through CLI options.
-m, --model-alias <ALIAS>
Model alias [default: default]
-c, --ctx-size <CTX_SIZE>
Size of the prompt context [default: 4096]
Size of the prompt context [default: 512]
-n, --n-predict <N_PRDICT>
Number of tokens to predict [default: 1024]
-g, --n-gpu-layers <N_GPU_LAYERS>
Number of layers to run on the GPU [default: 100]
-b, --batch-size <BATCH_SIZE>
Batch size for prompt processing [default: 4096]
Batch size for prompt processing [default: 512]
-r, --reverse-prompt <REVERSE_PROMPT>
Halt generation at PROMPT, return control.
-s, --system-prompt <SYSTEM_PROMPT>
Expand All @@ -148,8 +122,8 @@ The `--prompt-template` option is perhaps the most interesting. It allows the ap
Furthermore, the following command tells WasmEdge to print out logs and statistics of the model at runtime.

```bash
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
llama-chat.wasm --prompt-template llama-2-chat --log-stat
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
llama-chat.wasm --prompt-template llama-3-chat --log-stat
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
Expand All @@ -173,7 +147,7 @@ You can make the inference program run faster by AOT compiling the wasm file fir
```bash
wasmedge compile llama-chat.wasm llama-chat.wasm
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-chat.wasm
```
## Understand the code
Expand All @@ -185,7 +159,7 @@ The [main.rs](https://github.com/second-state/llamaedge/blob/main/chat/src/main.
curl -LO https://github.com/second-state/llamaedge/releases/latest/download/llama-simple.wasm
# Give it a prompt and ask it to use the model to complete it.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-simple.wasm \
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf llama-simple.wasm \
--prompt 'Robert Oppenheimer most important achievement is ' --ctx-size 512
output: in 1942, when he led the team that developed the first atomic bomb, which was dropped on Hiroshima, Japan in 1945.
Expand Down Expand Up @@ -275,7 +249,7 @@ Next, execute the model inference.
context.compute().expect("Failed to complete inference");
```
After the inference is finished, extract the result from the computation context and lose invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
After the inference is finished, extract the result from the computation context and losing invalid UTF8 sequences handled by converting the output to a string using `String::from_utf8_lossy`.
```rust
let mut output_buffer = vec![0u8; *CTX_SIZE.get().unwrap()];
Expand All @@ -296,5 +270,5 @@ println!("\noutput: {}", output);
## Resources
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llamaedge/tree/main/chat).
* If you want to construct OpenAI-compatible APIs specifically for any open-source LLMs, please check out the source code [for the API server](https://github.com/second-state/llamaedge/tree/main/api-server).
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llamaedge/tree/main/api-server).
* To learn more, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).

0 comments on commit e053703

Please sign in to comment.