fast-llama is a super high-performance inference engine for LLMs like LLaMA (2.5x of llama.cpp
) written in pure C++
. It can run a 8-bit
quantized LLaMA2-7B
model on a cpu with 56 cores in speed of ~25 tokens / s
. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with ~2.5 times better inference speed on a CPU.
Feature Name | Current Support | Future Suport |
---|---|---|
Model Types | ✅LLaMA2 | Others LLMs like Baichuan, StableDiffusion |
Quantization | ✅INT16, ✅INT8 | INT4 |
Model Formats | ✅HuggingFace, ✅gguf(by llama.cpp), ✅flm | |
Systems | ✅Linux, ✅Windows | Macbook, Android, iOS |
CPU/GPU | ✅X86/64 CPU | ARM, Apple Mx CPUs, GPU, CPU+GPU |
Architectures | ✅UMA, ✅NUMA |
Why you should use Fast-LLaMA?
Fast
- Extremely fast on CPU.
Faster
than any other engines on Github including llama.cpp.
- Extremely fast on CPU.
Simple
- Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
"Easy To Use"
(target☺️ )
Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.
GCC 10.x
or newer versionslibnuma-dev
if your computer has more than one physical CPUsLinux Kernel v5.x
or higher is needed for NUMA
Method 1. Using the provided build script:
bash ./build.sh
Method 2. Using Make:
make -j 4
Step 1
: Download a model
See llama2.c
Step 2
: Run the model
./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'
Step 1
: Download a model
Step 2
: Convert the model info FLM format
python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8
Step 3
: Run the model
./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'
All supported command-line options are as follows:
-c
: Path to the model file-f
: Model file format (e.g., gguf)-j
: Number of threads to use (e.g., 56)-q
: Quantization mode (e.g., int8)-n
: Number of tokens to generate (e.g., 200)-i
: Input text (e.g., 'That was a long long story happened in the ancient China.')-h
: show usage information
Below are some incomplete test results
Model | Model Size | OutputSpeed/8 threads |
OutputSpeed/28 threads |
OutputSpeed/56 threads |
---|---|---|---|---|
stories110M | 110M | 237 tps |
400 tps |
440 tps |
Chinese-LLaMA-1.3B | 1.3B | 38.9 tps |
127 tps |
155 tps |
Chinese-LLaMA-7B | 7B | 7.4 tps |
17.4 tps |
23.5 tps |
- Note: tps = tokens / second
- Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
- Quantization:
int8
- NUMA:
2
sockets- Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
- System: (
uname -a
)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - CPU:
56
physical cores,AVX-512
Architecture: x86_64
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s): 112 (56 physical cores)
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Latancy of first token will be optimized laterly.
Why is it so fast?
- Ultimate memory efficiency
- Zero memory allocations and frees during inferencing.
- Maximization of memory locality.
- Well-designed thread scheduling algorithm
- Optimized operators
- Fuse all operators that can be fused together
- Optmize calculation of several operators
- Proper Quantizations
fast-llama is licensed under the MIT.
Special thanks to AlpinDale for his professional, meticulous, and patient guidance and assistance.
Email: 📩topcoderlsf@gmail.com
Contact me if you any questions.