This is the artifact of the paper "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving". We are going to guide you through the process of reproducing the main results in the paper.
Here is a high level overview of the whole process:
- Environment Setup: Create a GPU instance on RunPod from our provided template with all the environment already setup.
- Kick-the-tires: Run some toy examples to verify DistServe and vLLM are working.
- Full evaluation: Reproduce all the main results in the paper.
We use the cloud provider Runpod to create compute instances and run all the experiments on them. We provided credentials for you to log in to Runpod in hotcrp and you can start the instance from a template with all the environments set up already. Different experiments require a different number of compute resources, please follow the detailed guidance below in each section to create the instance.
It is appreciated to stop the instance when you finish the review process each time, we pay real dollars for the GPU hours :)
To save your time, we've preprocessed the datasets in advance and saved them to /app/dataset
in the template. If you want to reproduce the dataset, please follow this instruction.
15 human-minutes + 15 compute-minutes
Follow the steps below to create a instance with two A100 SXM 80GB
GPUs on RunPod with template DistServe-AE-GPU
:
- Log in to RunPod with the credentials provided in hotcrp.
- Switch the account from
osdi24ae
toHao Lab@UCSD
using the upper right button. - Click
Pods
in the left toolbar. - Click
+ Deploy
. - Choose
A100 SXM 80GB
. - Click
Change Template
and chooseDistServe-AE-GPU
. - Choose
GPU Count
: For Kick-the-tires, 2 GPUs are sufficient, which is usually always available on RunPod. - Click
Deploy On-Demand
: If the button is grey, it means this resource is not currently available. - We suggest you to change the instance name to
DistServe-AE-GPU-<your_reviewer_id>
, to distinguish it from other reviewers' instances and avoid conflicts with other reviewers. To achieve this, please navigate to thePods
page, click on the down arrow on the right side of the instance name, click on the pencil icon next to the instance name, and change the name toDistServe-AE-GPU-<your_reviewer_id>
.
When the instance is started, you can ssh into the instance in your terminal. Remember to provide your public key on the hotcrp so that we can give you the access to the instance you create. Here are some high-level overviews and notes:
- From now on, we need to use two terminals simultaneously, one for the server (i.e. the inference engine), and one for the client (i.e. the load generator). We will refer to them as
S-terminal
andC-terminal
respectively. - We will use a wrapper script,
/app/distserve/distserve/evaluation/2-benchmark-serving/2-start-api-server.py
, to launch the API server. The script will print the command it uses to launch the server, which can be used to inspect the startup parameters. - The load generator locates at
/app/distserve/distserve/evaluation/2-benchmark-serving/2-benchmark-serving.py
. Given a target dataset and a list of (num_prompt, request_rate)s, it runs serval rounds of experiments, each with a given (num_prompt, request_rate) pair, and save the result to a file located at/workspace/exp-results/<model-name>-<dataset-name>/<backend>-<num_prompt>-<request_rate>.exp
, for example,/workspace/exp-results/opt-13b-sharegpt/vllm-50-2.exp
Now we can run some toy examples to verify DistServe and vLLM are working:
On the S-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/kick-the-tires/vllm-server.sh
Wait until the server is ready (i.e. # GPU blocks: XXX, # CPU blocks: XXX
pops up)
On the C-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/kick-the-tires/vllm-client.sh
In the script we add the --verbose
flag to print out all prompts && responses for a simple correctness check. In the full evaluation section, we will not use this flag.
Ideally it should run without any error, and generate a file /workspace/exp-results/opt-125m-sharegpt/vllm-10-1.exp
.
On the S-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/kick-the-tires/distllm-server.sh
Wait until the server is ready (i.e. the engine begins to print its status once per second)
On the C-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/kick-the-tires/distllm-client.sh
Ideally it should generate a file /workspace/exp-results/opt-125m-sharegpt/distserve-10-1.exp
. The file should contain a JSON object which looks like:
[{"prompt_len": 1135, "output_len": 12, "start_time": 200915.496689009, "end_time": 200915.565055445, "token_timestamps": [...]}, ...]
15 human-minutes + 90 compute-minutes
The OPT-175B experiment of DistServe requires four 8xA100-SXM-80GB machines. On common cloud providers like AWS or RunPod, this experiment costs over 2000$ in total for each run. Due to the limited budget, it is too expensive for us to reproduce the OPT-175B experiment (Figure. 8c) so we reuse the data in our paper. But we do provide the scripts for interested ones who have enough resources to produce the results by themselves.
For OPT-13B and OPT-66B End-to-end Experiments, 8 GPUs are required and we provide a script to grab the machine automatically because 8xA100-SXM machine is a ridiculously popular resource on clouds and it usually takes over 1 day to grab the machine. For instructions on how to use this script, please refer to this file.
For reviewers who do not want to experience this tedious machine-grabbing process, we provide the screencast of producing the results in each figure.
If you successfully obtain one 8xA100-SXM-80GB machine, please follow the instructions below to reproduce the results in Figure. 8 and Figure. 9.
Let's start with the OPT-13B experiment in Figure. 8:
First for vLLM:
On the S-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-13b-vllm-server.sh
Wait until the server is ready (i.e. # GPU blocks: XXX, # CPU blocks: XXX
pops up)
On the C-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-13b-vllm-client.sh
Wait until the client finishes (i.e. exits without any error)
Then for DistServe:
On the S-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-13b-distllm-server.sh
Wait until the server is ready (i.e. the engine begins to print its status once per second)
On the C-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-13b-distllm-client.sh
Wait until the client finishes (i.e. exits without any error)
And then let's move on to the OPT-66B experiment in Figure. 8:
First for vLLM:
On the S-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-66b-vllm-server.sh
Wait until the server is ready (i.e. # GPU blocks: XXX, # CPU blocks: XXX
pops up)
On the C-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-66b-vllm-client.sh
This script runs all three datasets (ShareGPT, HumanEval, LongBench) in sequence, which will take a while (~30 minutes).
Wait until the client finishes (i.e. exits without any error)
Then for DistServe:
On the S-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-66b-distllm-server.sh
Wait until the server is ready (i.e. the engine begins to print its status once per second)
On the C-terminal
, execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/opt-66b-distllm-client.sh
It will also take a while (~30 minutes). Wait until the client finishes (i.e. exits without any error)
Finally is to run the plotting script: execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/plot-fig-8-and-9.sh
Plots will be saved under /workspace/plots
.
Due to the same budget reason, we cannot afford to reproduce the OPT-175B experiment in the left figure of Figure. 10. However, we provide a OPT-66B version which can also verify our claim in this Section that the transmission time is negligible compared to computation in DistServe.
We also provide the screencast of producing the results in Figure. 10 in case the reviewers do not want to experience the machine-grabbing process.
If you have successfully obtained one 8xA100-SXM-80GB machine, after running end-to-end experiments above, you can execute
bash /app/distserve/distserve/evaluation/ae-scripts/e2e/plot-fig-10.sh
to generate Figure. 10. Plots will be saved under /workspace/plots
.
Compute Time: 5 min
The ablation study is CPU-only. We preferred you allocate RTX3090
or L40S
where 32 vCPU instance is available.
Follow the steps below to create a instance with one RTX3090
GPU instance on RunPod with template DistServe-AE-GPU
:
- Log in to RunPod with the credentials provided in hotcrp.
- Switch the account from
osdi24ae
toHao Lab@UCSD
using the upper right button. - Click
Pods
in the left toolbar. - Click
+ Deploy
. - Choose
RTX 3090
. Note thatvCPU
is 32. - Click
Change Template
and chooseDistServe-AE-GPU
. - Choose
GPU Count
: 1 GPUs is sufficient for this experiment. - Click
Deploy On-Demand
: If the button is grey, it means this resource is not currently available.
When you successfully log into the machine, execute the following commands to reproduce the results in Figure 11:
micromamba activate distserve
bash /app/distserve/simdistserve/setup.sh
bash /app/distserve/simdistserve/benchmarks/figure11-ablation/run_ablation.sh
Plots will be saved under /workspace/ablation.pdf