diff --git a/README.md b/README.md index 866ee5b..9c546c3 100644 --- a/README.md +++ b/README.md @@ -32,22 +32,22 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: `tokens/sec` -| Engine | float32 | float16 | int8 | int4 | -| ------------------------------------------ | ------------- | ------------- | ------------- | -------------- | -| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | -| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | -| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | -| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | -| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20 | -| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | -| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | -| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | -| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | -| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52 | 109.09 ± 4.26 | - | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|--------------|----------------|---------------|---------------| +| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | +| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | +| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | +| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | +| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| +| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | +| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | +| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | +| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | +| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | *(Data updated: `05th April 2024`) diff --git a/docs/llama2.md b/docs/llama2.md index fbc81f5..d1fd4bb 100644 --- a/docs/llama2.md +++ b/docs/llama2.md @@ -9,22 +9,22 @@ **Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -| ------------------------------------------ | ------------- | ------------- | ------------- | -------------- | -| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | -| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | -| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | -| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | -| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20 | -| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | -| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | -| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | -| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | -| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52 | 109.09 ± 4.26 | - | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|--------------|----------------|---------------|---------------| +| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | +| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | +| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | +| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | +| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| +| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | +| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | +| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | +| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | +| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | *(Data updated: `05th April 2024`) @@ -39,12 +39,12 @@ - Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'` **Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -| -------------------------------------- | ------- | ----------- | ------------ | ------------ | -| [candle](/bench_candle/) | - | 3.43 ± 0.02 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 | -| [ctranslate](/bench_ctranslate/) | - | - | 1.87 ± 0.14 | - | -| [ctransformers](/bench_ctransformers/) | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 | +| Engine | float32 | float16 | int8 | int4 | +|----------------------------------------|--------------|--------------|--------------|--------------| +| [candle](/bench_candle/) | - | 3.43 ± 0.02 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 | +| [ctranslate](/bench_ctranslate/) | - | - | 1.87 ± 0.14 | - | +| [ctransformers](/bench_ctransformers/) | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 | ### GPU (Metal) @@ -52,9 +52,9 @@ **Command:** `./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'` **Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -| -------------------------------------- | ------- | ------- | ------------ | ------------ | -| [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | -| [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | +| Engine | float32 | float16 | int8 | int4 | +|-----------------------------------------|--------------|---------------|--------------|--------------| +| [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | +| [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | *(Data updated: `05th April 2024`)