diff --git a/bench_vllm/README.md b/bench_vllm/README.md index bf65feb..16c6869 100644 --- a/bench_vllm/README.md +++ b/bench_vllm/README.md @@ -5,30 +5,66 @@ [vLLM](https://github.com/vllm-project/vllm) is a high-performance library designed for efficient language model (LLM) inference and serving. With a focus on speed, it incorporates state-of-the-art features such as PagedAttention for memory management, continuous batching, and optimized CUDA kernels. It supports various models like LLama, Falcon etc. It is very much flexible and also supports different decoding methods, tensor, distributed inference etc. -### 🚀 Running the vLLM Benchmark. +## 🚀 Running the vLLM Benchmark. -You can run the vLLM benchmark using the following command: +We can run the vLLM benchmark for two models: + +- [Llama2 7B Chat AWQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ) for `int4-float16` mixed precision and [Llama2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for `float16` and `float32` precision. + +- [Mistral-7B v0.1 Instruct AWQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-AWQ) for `int4-float16` mixed precision and [Mistral-7B v0.1 Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) for `float16` and `float32` precision. + +Here is how we run benchmark for vLLM. ```bash ./bench_vllm/bench.sh \ - --prompt \ # Enter a prompt string - --max_tokens \ # Maximum number of tokens to output - --repetitions \ # Number of repititions to be made for the prompt. - --log_file \ # A .log file underwhich we want to write the results. - --device \ # The device in which we want to benchmark. - --models_dir # The directory in which model weights are present + --prompt \ # Enter a prompt string + --max_tokens \ # Maximum number of tokens to output + --repetitions \ # Number of repititions to be made for the prompt. + --device \ # The device in which we want to benchmark. + --model_name # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1) ``` To get started quickly you can simply run: ```bash -./bench_vllm/bench.sh -d cuda +./bench_vllm/bench.sh -d cuda -n llama ``` -This will take all the default values (see in the [bench.sh](/bench_vllm/bench.sh) file) and perform the benchmarks. You can find all the benchmarks results for vLLM [here](/docs/llama2.md). -**Note:** vLLM seems to very dependent on cuda versions. So it is recommended to either have cuda version `>= 12.0`. We also do have support for cuda version `< 12.0`. +This will take all the default values (see in the [bench.sh](/bench_vllm/bench.sh) file) and do the benchmarks for Llama 2 and Mistral models. You can find all the benchmarks results for Llama 2 7B Chat [here](/docs/llama2.md), and Mistral 7B instruct [here](/docs/mistral.md). + +## 💎 Quality Checks + +Please note. we consider the outputs from the raw (float32 precision) from HuggingFace Transformers (PyTorch backend) weights as our ground truth. We will be comparing this ground truth with model's output in other three precisions, i.e. `float16`, `int8` and `int4`. This is our system prompt. + +> You answers should always be to the point, precise and not more than 2 sentences strictly + +### Model: Mistral 7B Instruct v0.1 + +To merge the two readme tables into a final table with the specified columns, we can combine the common columns from both tables. Here is the merged table: + +| Question | float32 | float16 | int4 | Ground Truth | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| I'm making pancakes for breakfast. I added a cup of flour, a teaspoon of salt, and a few tablespoons of sugar to a bowl. I stirred it together, then added a cup of milk, a beaten egg, and a few tablespoons of oil, and stirred until just mixed. Then I put 1/4 a cup on a hot frying pan, and flipped it when brown. But they're terrible! Why? List the main reason. Answer as much precise as possible with one sentence. | The main reason for the terrible pancakes could be that the batter was too thick, which resulted in a dry and tough texture. | The main reason for the terrible pancakes could be that the batter was too thick, which resulted in a dry and tough texture. | The main reason for the terrible pancakes could be that the batter was over-mixed, causing it to become tough and dense. | The main reason for the terrible pancakes could be that the batter was too thick, which resulted in a dry and tough texture. | +| 42 birds are sitting on a tree branch. A hunter passes, shoots one dead, and misses two. How many birds are left on the branch? Answer as much precise as possible with one sentence. | One bird is left on the branch. | One bird is left on the branch. | One bird is left on the branch. | One bird is left on the branch. | +| How many mistakes can you find in the following sentence? Sentence: Its to much to loose if your talking about hundred’s of dollars. Answer as much precise as possible with one sentence. | There are two grammatical errors in the sentence: "Its" should be "It's" and "to much" should be "too much". | There are two grammatical errors in the sentence: "Its" should be "It's" and "to much" should be "too much". | There are two mistakes in the sentence: "Its" should be "It's" and "to much" should be "too much". | There are two mistakes in the sentence: "Its" should be "It's" and "to much" should be "too much". | +| You are an expert on agricultural chemical compounds. Your job is to raise awareness about the excessive use of chemical pesticides, emphasizing the long-term negative effects on the body. Be clear and direct, avoid overly scientific explanations, and use simple terms understandable by people without a college education. Answer as much precise as possible with one sentence. | Excessive use of chemical pesticides can lead to long-term negative health effects, including cancer, neurological damage, and reproductive issues. | Excessive use of chemical pesticides can lead to long-term negative health effects, including cancer, neurological damage, and reproductive issues. | Excessive use of chemical pesticides can lead to long-term negative health effects, including cancer, neurological damage, and reproductive issues. | Excessive use of chemical pesticides can lead to long-term negative health effects, including cancer, neurological damage, and reproductive issues. | +| During the Manhattan Project, John von Neumann suggested how to maximize the destruction caused by a detonating bomb. Which was his idea? Answer as much precise as possible with one sentence. | John von Neumann proposed the use of a spherical configuration of bombs to maximize the destructive force of a detonation. | John von Neumann proposed the use of a spherical configuration of bombs to maximize the destructive force of a detonation. | John von Neumann proposed using a spherical configuration of bombs to maximize the destructive force of a detonation. | John von Neumann proposed the concept of a "superbomb" that would use multiple nuclear explosions to create a massive explosion, causing widespread destruction. | + +This table combines the `float32`, `float16`, and `int4` columns from the two tables while keeping the `Question` and `Ground Truth` columns intact. If you need any further adjustments or have specific formatting requirements, let me know! + + +### Model: Llama-2 7B Chat + +| Question | int4 | float32 | float16 | Ground Truth | +| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| I'm making pancakes for breakfast. I added a cup of flour, a teaspoon of salt, and a few tablespoons of sugar to a bowl. I stirred it together, then added a cup of milk, a beaten egg, and a few tablespoons of oil, and stirred until just mixed. Then I put 1/4 a cup on a hot frying pan, and flipped it when brown. But they're terrible! Why? List the main reason. Answer as much precise as possible with one sentence. | The main reason your pancakes are terrible is because you did not mix the ingredients well enough, resulting in a dense and flat pancake. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and flat pancake. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and flat pancake. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and flat pancake. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and flat pancake. | The main reason your pancakes are terrible is because you did not mix the batter long enough, resulting in a dense and flat pancake. | +| 42 birds are sitting on a tree branch. A hunter passes, shoots one dead, and misses two. How many birds are left on the branch? Answer as much precise as possible with one sentence. | After the hunter shoots and misses two birds, there will be 40 birds left on the branch. | After the hunter shoots and kills one bird, 41 birds remain on the branch. | After the hunter shoots and kills one bird, 41 birds remain on the branch. | After the hunter shoots and kills one bird, 41 birds remain on the branch. | After the hunter shoots and kills one bird, 41 birds remain on the branch. | After the hunter shoots and kills one bird, 41 birds remain on the branch. | +| How many mistakes can you find in the following sentence? Sentence: Its to much to loose if your talking about hundred’s of dollars. Answer as much precise as possible with one sentence. | There are 2 mistakes in the sentence: "It's" should be "Too much" and "loose" should be "lose". | There are 3 mistakes in the sentence:

* "to loose" should be "to lose"
* "hundred's" should be "hundreds"
* "if your" should be "if you're" | There are 3 mistakes in the sentence:

* "to loose" should be "to lose"
* "hundred's" should be "hundreds"
* "if your" should be "if you're" | There are 2 mistakes in the sentence: "to loose" should be "to lose" and "hundred's" should be "hundreds". | There are 2 mistakes in the sentence: "to loose" should be "to lose" and "hundred's" should be "hundreds". | There are 2 mistakes in the sentence: "to loose" should be "to lose" and "hundred's" should be "hundreds". | +| You are an expert on agricultural chemical compounds. Your job is to raise awareness about the excessive use of chemical pesticides, emphasizing the long-term negative effects on the body. Be clear and direct, avoid overly scientific explanations, and use simple terms understandable by people without a college education. Answer as much precise as possible with one sentence. | Excessive use of chemical pesticides can have severe long-term negative effects on the human body, including hormone disruption, neurological problems, and even cancer. Prolonged exposure to these toxic substances can lead to a range of health issues, from headaches and fatigue to more serious conditions like Parkinson's disease and Alzheimer's. | Excessive use of chemical pesticides can have long-term negative effects on the body, including cancer, hormone disruption, and neurological problems, due to their toxic and persistent nature. Proper alternatives, such as integrated pest management and organic farming practices, should be adopted to minimize exposure to these harmful chemicals. | Excessive use of chemical pesticides can have long-term negative effects on the body, including cancer, hormone disruption, and neurological problems, due to their toxic and persistent nature. Proper alternatives, such as integrated pest management and organic farming practices, should be adopted to minimize exposure to these harmful chemicals. | Excessive use of chemical pesticides can have long-term negative effects on the body, including cancer, hormone disruption, and neurological problems, due to their toxic and persistent nature, which can accumulate in the body over time and cause irreversible damage. It is important to adopt sustainable and organic farming practices that prioritize the health of both people and the environment, such as crop rotation, composting, and integrated pest management, to minimize the use of harmful chemicals and promote a healthier | Excessive use of chemical pesticides can have long-term negative effects on the body, including cancer, hormone disruption, and neurological problems, due to their toxic and persistent nature, which can accumulate in the body over time and cause irreversible damage. It is important to adopt sustainable and organic farming practices that prioritize the health of both people and the environment, such as crop rotation, composting, and integrated pest management, to minimize the use of harmful chemicals and promote a healthier | Excessive use of chemical pesticides can have long-term negative effects on the body, including cancer, hormone disruption, and neurological problems, due to their toxic and persistent nature, which can accumulate in the body over time and cause irreversible damage. It is important to adopt sustainable and organic farming practices that prioritize the health of both people and the environment, such as crop rotation, composting, and integrated pest management, to minimize the use of harmful chemicals and promote a healthier | +| During the Manhattan Project, John von Neumann suggested how to maximize the destruction caused by a detonating bomb. Which was his idea? Answer as much precise as possible with one sentence. | John von Neumann proposed the concept of a "fusing the core" strategy, which involved detonating a nuclear bomb at the center of a nuclear reactor to maximize the destruction caused by the explosion. This idea was intended to prevent the enemy from using nuclear reactors as a weapon by rendering them inoperable. | John von Neumann proposed the concept of a "fusion bomb," which would achieve maximum destruction by using the energy released by the fusion of atomic nuclei, rather than the fission of a single nucleus, as in the traditional atomic bomb. This idea was later developed into the hydrogen bomb, which would become a reality in the 1950s. | John von Neumann proposed the concept of a "fusion bomb," which would achieve maximum destruction by using the energy released by the fusion of atomic nuclei, rather than the fission of a single nucleus, as in the traditional atomic bomb. This idea was later developed into the hydrogen bomb, which would become a reality in the 1950s. | John von Neumann proposed the concept of a "fusion bomb," which would achieve maximum destruction through the simultaneous detonation of a fission bomb and a fusion reaction, creating a chain reaction of nuclear reactions that would release an enormous amount of energy. This idea was later developed into the hydrogen bomb. | John von Neumann proposed the concept of a "fusion bomb," which would achieve maximum destruction through the simultaneous detonation of a fission bomb and a fusion reaction, creating a chain reaction of nuclear reactions that would release an enormous amount of energy. This idea was later developed into the hydrogen bomb. | John von Neumann proposed the concept of a "fusion bomb," which would achieve maximum destruction through the simultaneous detonation of a fission bomb and a fusion reaction, creating a chain reaction of nuclear reactions that would release an enormous amount of energy. This idea was later developed into the hydrogen bomb. | + -### 👀 Some points to note: +## 👀 Some points to note: 1. Running this benchmark requires [HuggingFace Llama2-7B weights](https://huggingface.co/meta-llama/Llama-2-7b). So running this benchmark would assume that you already agreed to the required [terms and conditions](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and got verified to download the weights. 2. vLLM Does not support CPU (check [this](https://github.com/vllm-project/vllm/issues/176) issue) and Metal devices (check [this](https://github.com/vllm-project/vllm/issues/1441) issue). diff --git a/bench_vllm/bench.py b/bench_vllm/bench.py index 6edefc3..945c530 100644 --- a/bench_vllm/bench.py +++ b/bench_vllm/bench.py @@ -1,67 +1,98 @@ -import argparse import gc -import logging +import os import sys -import time -from collections import defaultdict -import numpy as np import torch -from vllm import LLM +from transformers import AutoTokenizer +from vllm import LLM, SamplingParams from vllm.model_executor.parallel_utils import parallel_state -logging.getLogger("vllm").setLevel(logging.ERROR) -logging.basicConfig( - stream=sys.stdout, - level=logging.INFO, - format="%(asctime)s - %(levelname)s - %(message)s", -) +sys.path.append(os.getcwd()) +from common.base import BaseBenchmarkClass # noqa +from common.utils import launch_cli, make_report # noqa -class LlamaVLLMBenchmark: - def __init__(self, model_path: str, device: str, precision: str): - # VLLM is not supported for CPU issue: https://github.com/vllm-project/vllm/issues/176 - # VLLM also not supports Metal, issue: https://github.com/vllm-project/vllm/issues/1441 - assert device == "cuda", ValueError("Supported device is cuda only.") - assert precision in ["fp16", "fp32", "int4"], ValueError( - "supported precision are: fp16, fp32 and int4" +class VLLMBenchmark(BaseBenchmarkClass): + def __init__( + self, + model_path: str, + model_name: str, + benchmark_name: str, + precision: str, + device: str, + experiment_name: str, + ) -> None: + assert device == "cuda", ValueError("Only supported device is 'cuda'") + assert precision in ["float16", "float32", "int4"], ValueError( + "supported precision are: 'float16', 'float32' and 'int4'" ) - self.model_path, self.precision, self.device = model_path, precision, device - self.results = [] - self.precision_map = {"fp16": "float16", "fp32": "float32"} + super().__init__( + model_name=model_name, + model_path=model_path, + benchmark_name=benchmark_name, + experiment_name=experiment_name, + precision=precision, + device=device, + ) - def load_model(self): - if self.precision != "int4": - self.model = LLM(model=self.model_path) - self.model.dtype = self.precision_map[precision] + if model_name == "llama": + self.tokenizer_folder = os.path.join( + os.getcwd(), "models", "llama-2-7b-chat-hf" + ) else: + self.tokenizer_folder = os.path.join( + os.getcwd(), "models", "mistral-7b-v0.1-instruct-hf" + ) + + def load_model_and_tokenizer(self): + if self.precision == "int4": self.model = LLM( model=self.model_path, quantization="AWQ", tensor_parallel_size=1 ) + else: + self.model = LLM(model=self.model_path) + self.model.dtype = self.precision + self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_folder) return self - def run_model(self, prompt: str, max_tokens: int) -> float: - self.model.max_num_seqs = max_tokens - start = time.time() - output = self.model.generate(prompts=[prompt]) - delta = time.time() - start - return len(output[0].outputs[0].token_ids) / delta - - def benchmark( - self, - prompt: str, - max_tokens: int, - repetitions: int, - ) -> None: - for i in range(repetitions): - logging.info( - f"Running repetition [{str(i+1).zfill(len(str(repetitions)))}/{repetitions}]" + def preprocess( + self, prompt: str, chat_mode: bool = True, for_benchmarks: bool = True + ): + if chat_mode: + template = self.get_chat_template_with_instruction( + prompt=prompt, for_benchmarks=for_benchmarks ) - tokens_per_second = self.run_model(prompt, max_tokens) - self.results.append(tokens_per_second) + prompt = self.tokenizer.apply_chat_template(template, tokenize=False) + + tokenized_input = self.tokenizer.encode(text=prompt) + return { + "prompt": prompt, + "input_tokens": tokenized_input, + "tensor": None, + "num_input_tokens": len(tokenized_input), + } + + def run_model(self, inputs: dict, max_tokens: int, temperature: float) -> dict: + prompt = [inputs["prompt"]] + + sampling_params = SamplingParams(max_tokens=max_tokens, temperature=temperature) + output = self.model.generate(prompt, sampling_params) + + generated_text = output[0].outputs[0].text + generated_tokens = output[0].outputs[0].token_ids + return { + "output_tokens": generated_tokens, + "num_output_tokens": len(generated_tokens), + "output_prompt": generated_text, + } + + def postprocess(self, output: dict) -> str: + return output["output_prompt"] + + def on_exit(self): if self.device == "cuda": parallel_state.destroy_model_parallel() del self.model @@ -74,67 +105,37 @@ def benchmark( if __name__ == "__main__": - parser = argparse.ArgumentParser(description="vllm Benchmark.") - parser.add_argument( - "--prompt", - type=str, - help="The prompt for the model.", - ) - parser.add_argument("--max_tokens", type=int, help="The maximum number of tokens.") - parser.add_argument( - "--repetitions", - type=int, - help="The number of repetitions for the benchmark.", - ) - parser.add_argument( - "--device", - help="Device to use for the benchmark.", - ) - parser.add_argument( - "--log_file", - type=str, - help="Path to the log file for writing logs (in append mode).", - ) - parser.add_argument( - "--models_dir", - type=str, - help="Path to the models directory.", - ) + parser = launch_cli(description="vLLM Benchmark.") args = parser.parse_args() - logging.info( - f"Running benchmark with: max_tokens={args.max_tokens} prompt={args.prompt} " - + f"repetitions={args.repetitions} device={args.device}" - ) - report = defaultdict(lambda: defaultdict(float)) - for precision in ("fp32", "fp16", "int4"): - logging.info(f"Running VLLM benchmark on Llama on {precision} precision.") - - llama_vllm_bench = LlamaVLLMBenchmark( - f"{args.models_dir}/llama-2-7b-hf" - if precision != "int4" - else f"{args.models_dir}/llama-2-7b-autoawq", - device=args.device, - precision=precision, - ).load_model() - - llama_vllm_bench.benchmark( - max_tokens=args.max_tokens, prompt=args.prompt, repetitions=args.repetitions - ) - - report["llama_vllm"][precision] = { - "mean": np.mean(llama_vllm_bench.results), - "std": np.std(llama_vllm_bench.results), - } + model_folder = os.path.join(os.getcwd(), "models") + model_name = ( + f"{args.model_name}-2-7b-chat-" + if args.model_name == "llama" + else f"{args.model_name}-7b-v0.1-instruct-" + ) - logging.info("Benchmark report") - with open(args.log_file, "a") as file: - for framework, quantizations in report.items(): - for quantization, stats in quantizations.items(): - logging.info( - f"{framework}, {quantization}: {stats['mean']:.2f} ± {stats['std']:.2f}" - ) - print( - f"{framework}, {quantization}: {stats['mean']:.2f} ± {stats['std']:.2f}", - file=file, - ) + runner_dict = { + "cuda": [ + { + "precision": "float32", + "model_path": os.path.join(model_folder, model_name + "hf"), + }, + { + "precision": "float16", + "model_path": os.path.join(model_folder, model_name + "hf"), + }, + { + "precision": "int4", + "model_path": os.path.join(model_folder, model_name + "autoawq"), + }, + ] + } + + make_report( + args=args, + benchmark_class=VLLMBenchmark, + runner_dict=runner_dict, + benchmark_name="vLLM", + is_bench_pytorch=False, + ) diff --git a/bench_vllm/bench.sh b/bench_vllm/bench.sh index eb2afad..7898c95 100755 --- a/bench_vllm/bench.sh +++ b/bench_vllm/bench.sh @@ -2,24 +2,21 @@ ######################################################################################################## # Script: bench.sh -# Description: This script runs benchmarks VLLM Llama2 benchmark. +# Description: This script runs benchmarks vLLM benchmark. # # Usage: ./bench.sh [OPTIONS] # OPTIONS: -# -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture') -# -r, --repetitions Number of repetitions for benchmarks (default: 10) -# -m, --max_tokens Maximum number of tokens for benchmarks (default: 512) -# -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'cpu', default: 'cuda') -# -lf, --log_file Logging file name. -# -md, --models_dir Models directory. -# -h, --help Show this help message +# -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture') +# -r, --repetitions Number of repetitions for benchmarks (default: 10) +# -m, --max_tokens Maximum number of tokens for benchmarks (default: 512) +# -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'cpu', default: 'cuda') +# -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1) +# -lf, --log_file Logging file name. +# -h, --help Show this help message ######################################################################################################## set -euo pipefail -CURRENT_DIR="$(pwd)" -SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" - print_usage() { echo "Usage: $0 [OPTIONS]" echo "OPTIONS:" @@ -27,12 +24,13 @@ print_usage() { echo " -r, --repetitions Number of repetitions for benchmarks (default: 10)" echo " -m, --max_tokens Maximum number of tokens for benchmarks (default: 512)" echo " -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'cpu', default: 'cuda')" + echo " -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1)" echo " -lf, --log_file Logging file name." - echo " -md, --models_dir Models directory." echo " -h, --help Show this help message" exit 1 } +SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" check_cuda() { if command -v nvcc &> /dev/null @@ -71,20 +69,10 @@ check_python() { setup() { - - # Check if Logs folder exists else Make the logs folder - LOGS_FOLDER="$CURRENT_DIR/Logs" - - if [ -d "$LOGS_FOLDER" ]; then - echo "Folder '$LOGS_FOLDER' already exists. Skipping." - else - # Create the folder - mkdir "$LOGS_FOLDER" - echo "'$LOGS_FOLDER' created." - fi - + local DEVICE="$1" + local MODEL_NAME="${2:-llama}" echo -e "\nSetting up with $SCRIPT_DIR/setup.sh..." - bash "$SCRIPT_DIR"/setup.sh "$1" + bash "$SCRIPT_DIR/setup.sh" "$DEVICE" "$MODEL_NAME" } run_benchmarks() { @@ -92,8 +80,7 @@ run_benchmarks() { local REPETITIONS="$2" local MAX_TOKENS="$3" local DEVICE="$4" - local LOG_FILENAME="$5" - local MODELS_DIR="$6" + local MODEL_NAME="$5" # shellcheck disable=SC1091 source "$SCRIPT_DIR/venv/bin/activate" @@ -101,8 +88,7 @@ run_benchmarks() { --prompt "$PROMPT" \ --repetitions "$REPETITIONS" \ --max_tokens "$MAX_TOKENS" \ - --log_file "$LOG_FILENAME" \ - --models_dir "$MODELS_DIR" \ + --model_name "$MODEL_NAME" \ --device "$DEVICE" } @@ -127,25 +113,20 @@ while [ "$#" -gt 0 ]; do "cuda" | "metal" | "cpu") ;; *) - echo "Invalid value for --device. Please use 'cuda', 'metal' or 'cpu'." + echo "Invalid value for --device. Please use 'cuda', 'cpu' or 'metal'." print_usage ;; esac - if [ "$DEVICE" == "metal" ] || [ "$DEVICE" == "cpu" ]; then - echo "$DEVICE not supported" - exit 1 - fi if [ "$DEVICE" == "cuda" ]; then check_cuda + else + echo "Not supported for $DEVICE" + exit 1 fi shift 2 ;; - -lf|--log_file) - LOG_FILENAME="$2" - shift 2 - ;; - -md|--models_dir) - MODELS_DIR="$2" + -n|--model_name) + MODEL_NAME="$2" shift 2 ;; -h|--help) @@ -160,14 +141,13 @@ done check_platform check_python -setup "$DEVICE" # Set default values if not provided PROMPT="${PROMPT:-"Write an essay about the transformer model architecture"}" REPETITIONS="${REPETITIONS:-10}" MAX_TOKENS="${MAX_TOKENS:-512}" DEVICE="${DEVICE:-'cuda'}" -LOG_FILENAME="${LOG_FILENAME:-"$LOGS_FOLDER/benchmark_vllm_$(date +'%Y%m%d%H%M%S').log"}" -MODELS_DIR="${MODELS_DIR:-"./models"}" +MODEL_NAME="${MODEL_NAME:-"llama"}" -run_benchmarks "$PROMPT" "$REPETITIONS" "$MAX_TOKENS" "$DEVICE" "$LOG_FILENAME" "$MODELS_DIR" +setup "$DEVICE" "$MODEL_NAME" +run_benchmarks "$PROMPT" "$REPETITIONS" "$MAX_TOKENS" "$DEVICE" "$MODEL_NAME" diff --git a/bench_vllm/setup.sh b/bench_vllm/setup.sh index 806df56..922bf36 100755 --- a/bench_vllm/setup.sh +++ b/bench_vllm/setup.sh @@ -8,7 +8,13 @@ set -euo pipefail -AWQ_WEIGHTS_FOLDER="${AWQ_WEIGHTS_FOLDER:-"./models/llama-2-7b-awq"}" +CURRENT_DIR="$(pwd)" +SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + +# Set default folder paths for AWQ weights +LLAMA2_AWQ_WEIGHTS_FOLDER="$CURRENT_DIR/models/llama-2-7b-chat-autoawq" +MISTRAL_AWQ_WEIGHTS_FOLDER="$CURRENT_DIR/models/mistral-7b-v0.1-instruct-autoawq" + check_python() { if command -v python &> /dev/null; then @@ -36,7 +42,7 @@ install_vllm_cuda() { if [ "$CUDA_MAJOR" -ge 12 ] || { [ "$CUDA_MAJOR" -eq 12 ] && [ "$CUDA_MINOR" -ge 0 ]; }; then echo "Detected CUDA version >= 12.2" - "$PYTHON_CMD" -m pip install vllm + "$PYTHON_CMD" -m pip install vllm==0.4.0 transformers==4.39.2 else echo "Detected CUDA version < 12.2" PY_VERSION=$(get_python_version) @@ -48,7 +54,7 @@ install_vllm_cuda() { # Download vllm for CUDA 11.8 and specified Python version "$PYTHON_CMD" -m pip install https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-"$PY_VERSION"-"$PY_VERSION"-manylinux1_x86_64.whl "$PYTHON_CMD" -m pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118 - "$PYTHON_CMD" -m pip install huggingface-cli==0.1 + "$PYTHON_CMD" -m pip install huggingface-cli==0.1 transformers==4.39.2 fi } @@ -96,26 +102,40 @@ install_device_specific_vllm() { } download_awq_weights() { - # download the sample file if not exists - if [ ! -d "$AWQ_WEIGHTS_FOLDER" ]; then - huggingface-cli download TheBloke/Llama-2-7B-AWQ --local-dir ./models/llama-2-7b-autoawq --exclude "*.git*" "*.md" "Notice" "LICENSE" + local MODEL_NAME="$1" + + # Set download directory based on MODEL_NAME + if [ "$MODEL_NAME" = "llama" ]; then + DOWNLOAD_DIR="$LLAMA2_AWQ_WEIGHTS_FOLDER" + MODEL_IDENTIFIER="TheBloke/Llama-2-7B-Chat-AWQ" + elif [ "$MODEL_NAME" = "mistral" ]; then + DOWNLOAD_DIR="$MISTRAL_AWQ_WEIGHTS_FOLDER" + MODEL_IDENTIFIER="TheBloke/Mistral-7B-Instruct-v0.1-AWQ" else - echo "Weights already downloaded!" + echo "Invalid MODEL_NAME. Supported values: 'llama', 'mistral'" + exit 1 fi -} + # Check if weights folder exists + echo "$DOWNLOAD_DIR" -# Main script starts here. + if [ ! -d "$DOWNLOAD_DIR" ]; then + # Download weights using huggingface-cli + echo "Downloading weights to $DOWNLOAD_DIR..." + huggingface-cli download "$MODEL_IDENTIFIER" --local-dir "$DOWNLOAD_DIR" --exclude "*.git*" "*.md" "Notice" "LICENSE" + else + echo "Weights already downloaded" + fi +} -if [ "$#" -ne 1 ]; then - echo "Usage: $0 " - exit 1 -fi SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -DEVICE="$1" VENV_DIR="$SCRIPT_DIR/venv" +DEVICE="$1" +MODEL_NAME="$2" + + # Build and activate the virtual environment. if [ ! -d "$VENV_DIR" ]; then @@ -130,4 +150,4 @@ else source "$VENV_DIR/bin/activate" fi -download_awq_weights +download_awq_weights "$MODEL_NAME" diff --git a/common/base.py b/common/base.py index f6b0049..9a0bd17 100644 --- a/common/base.py +++ b/common/base.py @@ -213,3 +213,4 @@ def get_answers(self): "expected": expected, } ) + self.on_exit() diff --git a/docs/llama2.md.template b/docs/llama2.md.template index 37240e2..23b94e2 100644 --- a/docs/llama2.md.template +++ b/docs/llama2.md.template @@ -18,6 +18,7 @@ | [ctransformers](/bench_ctransformers/) | - | - | 85.50 ± 1.00 | 86.66 ± 1.06 | | [llama.cpp](/bench_llamacpp/) | - | - | 89.90 ± 2.26 | 97.35 ± 4.71 | | [ctranslate](/bench_ctranslate/) | 46.26 ± 1.59 | 79.41 ± 0.37 | 48.20 ± 0.14 | - | +| [vllm](/bench_vllm/) | 89.40 ± 0.22 | 89.43 ± 0.19 | - | 115.52 ± 0.49 | | [exllamav2](/bench_exllamav2/) | - | - | 125.58 ± 1.23 | 159.68 ± 1.85 | @@ -32,6 +33,7 @@ | [ctransformers](/bench_ctransformers/) | - | - | 9774.83 | 6889.14 | | [llama.cpp](/bench_llamacpp/) | - | - | 8797.55 | 5783.95 | | [ctranslate](/bench_ctranslate/) | 29951.52 | 16282.29 | 9470.74 | - | +| [vllm](/bench_vllm/) | 77928.07 | 77928.07 | - | 77768.69 | | [exllamav2](/bench_exllamav2/) | - | - | 16582.18 | 7201.62 | *(Data updated: ``) diff --git a/docs/mistral.md.template b/docs/mistral.md.template index 3c82377..ecfa022 100644 --- a/docs/mistral.md.template +++ b/docs/mistral.md.template @@ -18,6 +18,7 @@ | [ctransformers](/bench_ctransformers/) | - | - | 86.14 ± 1.40 | 87.22 ± 1.54 | | [llama.cpp](/bench_llamacpp/) | - | - | 88.27 ± 0.72 | 95.33 ± 5.54 | | [ctranslate](/bench_ctranslate/) | 43.17 ± 2.97 | 68.03 ± 0.27 | 45.14 ± 0.24 | - | +| [vllm](/bench_vllm/) | 84.91 ± 0.27 | 84.89 ± 0.28 | - | 106.03 ± 0.53 | | [exllamav2](/bench_exllamav2/) | - | - | 114.81 ± 1.47 | 126.29 ± 3.05 | **Performance Metrics:** GPU Memory Consumption (unit: MB) @@ -31,6 +32,7 @@ | [ctransformers](/bench_ctransformers/) | - | - | 10255.07 | 6966.74 | | [llama.cpp](/bench_llamacpp/) | - | - | 9141.49 | 5880.41 | | [ctranslate](/bench_ctranslate/) | 32602.32 | 17523.8 | 10074.72 | - | +| [vllm](/bench_vllm/) | 73568.09 | 73790.39| - | 74016.88 | | [exllamav2](/bench_exllamav2/) | - | - | 21483.23 | 9460.25 |