Merge branch 'amd_docker' into legion_workflow

flexflow · Aug 31, 2023 · 8873c49 · 8873c49
2 parents 7e81ec2 + d483ed2
commit 8873c49
Show file tree

Hide file tree

Showing 283 changed files with 39,269 additions and 1,752 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -10,6 +10,3 @@ Linked Issues:
 Issues closed by this PR:
 - Closes #
 
-**Before merging:**
-
-- [ ] Did you update the [flexflow-third-party](https://github.com/flexflow/flexflow-third-party) repo, if modifying any of the Cmake files, the build configs, or the submodules?
diff --git a/.github/README.md b/.github/README.md
@@ -0,0 +1,230 @@
+# FlexFlow Serve: Low-Latency, High-Performance LLM Serving
+![build](https://github.com/flexflow/flexflow/workflows/build/badge.svg?branch=inference) ![gpu tests](https://github.com/flexflow/flexflow/workflows/gpu-ci/badge.svg?branch=inference) ![multinode gpu tests](https://github.com/flexflow/flexflow/workflows/multinode-test/badge.svg?branch=master) ![docker](https://github.com/flexflow/flexflow/workflows/docker-build/badge.svg?branch=inference) ![pip](https://github.com/flexflow/flexflow/workflows/pip-install/badge.svg?branch=inference) ![shell-check](https://github.com/flexflow/flexflow/workflows/Shell%20Check/badge.svg?branch=inference) ![clang-format](https://github.com/flexflow/flexflow/workflows/clang-format%20Check/badge.svg?branch=inference) [![Documentation Status](https://readthedocs.org/projects/flexflow/badge/?version=latest)](https://flexflow.readthedocs.io/en/latest/?badge=latest)
+
+
+---
+
+## News🔥:
+
+* [08/16/2023] Adding Starcoder model support
+* [08/14/2023] Released Dockerfile for different CUDA versions
+
+## What is FlexFlow Serve
+
+The high computational and memory requirements of generative large language
+models (LLMs) make it challenging to serve them quickly and cheaply. 
+FlexFlow Serve is an open-source compiler and distributed system for 
+__low latency__, __high performance__ LLM serving. FlexFlow Serve outperforms 
+existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 
+1.4-2.4x for multi-node, multi-GPU inference.
+
+<p align="center">
+<img src="https://github.com/flexflow/FlexFlow/blob/inference/img/performance.png?raw=true" alt="Performance comparison" height="320"/>
+</p>
+
+
+## Install FlexFlow Serve
+
+
+### Requirements
+* OS: Linux
+* GPU backend: Hip-ROCm or CUDA
+	* CUDA version: 10.2 – 12.0
+	* NVIDIA compute capability: 6.0 or higher
+* Python: 3.6 or higher
+* Package dependencies: [see here](https://github.com/flexflow/FlexFlow/blob/inference/requirements.txt)
+
+### Install with pip
+You can install FlexFlow Serve using pip:
+
+```bash
+pip install flexflow
+```
+
+### Try it in Docker
+If you run into any issue during the install, or if you would like to use the C++ API without needing to install from source, you can also use our pre-built Docker package for different CUDA versions and the `hip_rocm` backend. To download and run our pre-built Docker container:
+
+```bash
+docker run --gpus all -it --rm --shm-size=8g ghcr.io/flexflow/flexflow-cuda-11.8:latest
+```
+
+To download a Docker container for a backend other than CUDA v11.8, you can replace the `cuda-11.8` suffix with any of the following backends: `cuda-11.1`, `cuda-11.2`, `cuda-11.3`, `cuda-11.5`, `cuda-11.6`, `cuda-11.7`, `cuda-11.8`, and `hip_rocm`). More info on the Docker images, with instructions to build a new image from source, or run with additional configurations, can be found [here](../docker/README.md).
+
+### Build from source
+
+You can install FlexFlow Serve from source code by building the inference branch of FlexFlow. Please follow these [instructions](https://flexflow.readthedocs.io/en/latest/installation.html).
+
+## Quickstart
+The following example shows how to deploy an LLM using FlexFlow Serve and accelerate its serving using [speculative inference](#speculative-inference). First, we import `flexflow.serve` and initialize the FlexFlow Serve runtime. Note that `memory_per_gpu` and `zero_copy_memory_per_node` specify the size of device memory on each GPU (in MB) and zero-copy memory on each node (in MB), respectively. 
+We need to make sure the aggregated GPU memory and zero-copy memory are **both** sufficient to store LLM parameters in non-offloading serving. FlexFlow Serve combines tensor and pipeline model parallelism for LLM serving.
+```python
+import flexflow.serve as ff
+
+ff.init(
+        num_gpus=4,
+        memory_per_gpu=14000,
+        zero_copy_memory_per_node=30000,
+        tensor_parallelism_degree=4,
+        pipeline_parallelism_degree=1
+    )
+```
+Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
+```python
+# Specify the LLM
+llm = ff.LLM("decapoda-research/llama-7b-hf")
+
+# Specify a list of SSMs (just one in this case)
+ssms=[]
+ssm = ff.SSM("JackFram/llama-68m")
+ssms.append(ssm)
+```
+Next, we declare the generation configuration and compile both the LLM and SSMs. Note that all SSMs should run in the **beam search** mode, and the LLM should run in the **tree verification** mode to verify the speculated tokens from SSMs.
+```python
+# Create the sampling configs
+generation_config = ff.GenerationConfig(
+    do_sample=False, temperature=0.9, topp=0.8, topk=1
+)
+
+# Compile the SSMs for inference and load the weights into memory
+for ssm in ssms:
+    ssm.compile(generation_config)
+
+# Compile the LLM for inference and load the weights into memory
+llm.compile(generation_config, ssms=ssms)
+```
+Finally, we call `llm.generate` to generate the output, which is organized as a list of `GenerationResult`, which include the output tokens and text.
+```python
+result = llm.generate("Here are some travel tips for Tokyo:\n")
+```
+
+### Incremental decoding
+<details>
+<summary>Expand here</summary>
+<br>
+
+```python
+import flexflow.serve as ff
+
+# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
+ff.init(
+        num_gpus=4,
+        memory_per_gpu=14000,
+        zero_copy_memory_per_node=30000,
+        tensor_parallelism_degree=4,
+        pipeline_parallelism_degree=1
+    )
+
+# Create the FlexFlow LLM
+llm = ff.LLM("decapoda-research/llama-7b-hf")
+
+# Create the sampling configs
+generation_config = ff.GenerationConfig(
+    do_sample=True, temperature=0.9, topp=0.8, topk=1
+)
+
+# Compile the LLM for inference and load the weights into memory
+llm.compile(generation_config)
+
+# Generation begins!
+result = llm.generate("Here are some travel tips for Tokyo:\n")
+```
+
+</details>
+
+### C++ interface
+If you'd like to use the C++ interface (mostly used for development and benchmarking purposes), you should install from source, and follow the instructions below. 
+
+<details>
+<summary>Expand here</summary>
+<br>
+
+#### Downloading models
+Before running FlexFlow Serve, you should manually download the LLM and SSM(s) model of interest using the [inference/utils/download_hf_model.py](https://github.com/flexflow/FlexFlow/blob/inference/inference/utils/download_hf_model.py) script (see example below). By default, the script will download all of a model's assets (weights, configs, tokenizer files, etc...) into the cache folder `~/.cache/flexflow`. If you would like to use a different folder, you can request that via the parameter `--cache-folder`.
+
+```bash
+python3 ./inference/utils/download_hf_model.py <HF model 1> <HF model 2> ...
+```
+
+#### Running the C++ examples
+A C++ example is available at [this folder](../inference/spec_infer/). After building FlexFlow Serve, the executable will be available at `/build_dir/inference/spec_infer/spec_infer`. You can use the following command-line arguments to run FlexFlow Serve:
+
+* `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
+* `-ll:fsize`: size of device memory on each GPU in MB
+* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
+* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
+* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
+* `-cache-folder`: the folder
+* `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used. 
+* `-prompt`: (optional) path to the prompt file. FlexFlow Serve expects a json format file for prompts. In addition, users can also use the following API for registering requests:
+* `-output-file`: (optional) filepath to use to save the output of the model, together with the generation latency
+
+For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.
+
+```bash
+./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
+```
+</details>
+
+## Speculative Inference
+A key technique that enables FlexFlow Serve to accelerate LLM serving is speculative
+inference, which combines various collectively boost-tuned small speculative
+models (SSMs) to jointly predict the LLM’s outputs; the predictions are organized as a
+token tree, whose nodes each represent a candidate token sequence. The correctness
+of all candidate token sequences represented by a token tree is verified against the
+LLM’s output in parallel using a novel tree-based parallel decoding mechanism.
+FlexFlow Serve uses an LLM as a token tree verifier instead of an incremental decoder,
+which largely reduces the end-to-end inference latency and computational requirement
+for serving generative LLMs while provably preserving model quality.
+
+<p align="center">
+<img src="https://github.com/flexflow/FlexFlow/blob/inference/img/spec_infer_demo.gif?raw=true" alt="A Speculative Inference Demo" width="630"/>
+</p>
+
+### Supported LLMs and SSMs
+
+FlexFlow Serve currently supports all HuggingFace models with the following architectures:
+* `LlamaForCausalLM` / `LLaMAForCausalLM` (e.g. LLaMA/LLaMA-2, Guanaco, Vicuna, Alpaca, ...)
+* `OPTForCausalLM` (models from the OPT family)
+* `RWForCausalLM` (models from the Falcon family)
+* `GPTBigCodeForCausalLM` (models from the Starcoder family)
+
+Below is a list of models that we have explicitly tested and for which a SSM may be available:
+
+| Model | Model id on HuggingFace | Boost-tuned SSMs |
+| :---- | :---- | :---- |
+| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| OPT-6.7B | facebook/opt-6.7b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-13B | facebook/opt-13b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-30B | facebook/opt-30b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-66B | facebook/opt-66b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| Falcon-7B | tiiuae/falcon-7b | |
+| Falcon-40B | tiiuae/falcon-40b | |
+| StarCoder-7B | bigcode/starcoderbase-7b | |
+| StarCoder-15.5B | bigcode/starcoder | |
+
+### CPU Offloading
+FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention). Besides, since the small model occupies considerably less space, it it does not pose a bottleneck for GPU memory, the offloading will bring more runtime space and computational cost, so we only do the offloading for the large model. You can run the offloading example by enabling the `-offload` and `-offload-reserve-space-size` flags.
+
+### Quantization
+FlexFlow Serve supports int4 and int8 quantization. The compressed tensors are stored on the CPU side. Once copied to the GPU, these tensors undergo decompression and conversion back to their original precision. Please find the compressed weight files in our s3 bucket, or use [this script](../inference/utils/compress_llama_weights.py) from [FlexGen](https://github.com/FMInference/FlexGen) project to do the compression manually.
+
+### Prompt Datasets
+We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruction prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatbot.json), [ChatGPT Prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json), [WebQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/webqa.json), [Alpaca](https://specinfer.s3.us-east-2.amazonaws.com/prompts/alpaca.json), and [PIQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/piqa.json).
+
+## TODOs
+
+FlexFlow Serve is under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.
+
+* AMD support. We are actively working on supporting FlexFlow Serve on AMD GPUs and welcome any contributions to this effort. 
+
+## Acknowledgements
+This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. 
+
+## License
+FlexFlow uses Apache License 2.0.
diff --git a/.github/workflows/build-skip.yml b/.github/workflows/build-skip.yml
@@ -3,6 +3,7 @@ on:
   pull_request:
     paths-ignore:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "deps/**"

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -3,6 +3,7 @@ on:
   pull_request:
     paths:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "deps/**"
@@ -15,6 +16,7 @@ on:
       - "master"
     paths:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "deps/**"
@@ -146,6 +148,8 @@ jobs:
       matrix:
         gpu_backend: ["cuda", "hip_rocm"]
       fail-fast: false
+    env:
+      FF_GPU_BACKEND: ${{ matrix.gpu_backend }}
     steps:
       - name: Checkout Git Repository
         uses: actions/checkout@v3
@@ -157,14 +161,15 @@ jobs:
 
       - name: Install CUDA
         uses: Jimver/cuda-toolkit@v0.2.11
+        if: ${{ matrix.gpu_backend == 'cuda' }}
         id: cuda-toolkit
         with:
           cuda: "11.8.0"
           # Disable caching of the CUDA binaries, since it does not give us any significant performance improvement
           use-github-cache: "false"
 
       - name: Install system dependencies
-        run: FF_GPU_BACKEND=${{ matrix.gpu_backend }} .github/workflows/helpers/install_dependencies.sh
+        run: .github/workflows/helpers/install_dependencies.sh
 
       - name: Install conda and FlexFlow dependencies
         uses: conda-incubator/setup-miniconda@v2
@@ -178,17 +183,25 @@ jobs:
           export CUDNN_DIR="$CUDA_PATH"
           export CUDA_DIR="$CUDA_PATH"
           export FF_HOME=$(pwd)
-          export FF_GPU_BACKEND=${{ matrix.gpu_backend }}
           export FF_CUDA_ARCH=70
+          export FF_HIP_ARCH=gfx1100,gfx1036
+          export hip_version=5.6
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+
+          if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
+            export FF_BUILD_ALL_EXAMPLES=ON
+            export FF_BUILD_UNIT_TESTS=ON
+          else 
+            export FF_BUILD_ALL_EXAMPLES=OFF
+            export FF_BUILD_UNIT_TESTS=OFF
+          fi
+
           cores_available=$(nproc --all)
           n_build_cores=$(( cores_available -1 ))
           if (( $n_build_cores < 1 )) ; then n_build_cores=1 ; fi
           mkdir build
           cd build
-          if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
-            export FF_BUILD_ALL_EXAMPLES=ON 
-            export FF_BUILD_UNIT_TESTS=ON
-          fi
+          
           ../config/config.linux
           make -j $n_build_cores
 
@@ -197,35 +210,45 @@ jobs:
           export CUDNN_DIR="$CUDA_PATH"
           export CUDA_DIR="$CUDA_PATH"
           export FF_HOME=$(pwd)
-          export FF_GPU_BACKEND=${{ matrix.gpu_backend }}
           export FF_CUDA_ARCH=70
-          cd build
+          export FF_HIP_ARCH=gfx1100,gfx1036
+          export hip_version=5.6
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+          
           if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
-            export FF_BUILD_ALL_EXAMPLES=ON 
+            export FF_BUILD_ALL_EXAMPLES=ON
             export FF_BUILD_UNIT_TESTS=ON
+          else 
+            export FF_BUILD_ALL_EXAMPLES=OFF
+            export FF_BUILD_UNIT_TESTS=OFF
           fi
+
+          cd build
           ../config/config.linux
           sudo make install
           sudo ldconfig
 
-      - name: Check availability of Python flexflow.core module
-        if: ${{ matrix.gpu_backend == 'cuda' }}
-        run: |
-          export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
-          sudo ln -s "$CUDA_PATH/lib64/stubs/libcuda.so" "$CUDA_PATH/lib64/stubs/libcuda.so.1"
-          export CPU_ONLY_TEST=1
-          python -c "import flexflow.core; exit()"
-
       - name: Run C++ unit tests
         if: ${{ matrix.gpu_backend == 'cuda' }}
         run: |
           export CUDNN_DIR="$CUDA_PATH"
           export CUDA_DIR="$CUDA_PATH"
           export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
           export FF_HOME=$(pwd)
+          sudo ln -s "$CUDA_PATH/lib64/stubs/libcuda.so" "$CUDA_PATH/lib64/stubs/libcuda.so.1"
           cd build
           ./tests/unit/unit-test
 
+      - name: Check availability of Python flexflow.core module
+        run: |
+          if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
+            export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
+          fi
+          # Remove build folder to check that the installed version can run independently of the build files
+          rm -rf build
+          export CPU_ONLY_TEST=1
+          python -c "import flexflow.core; exit()"
+
   makefile-build:
     name: Build FlexFlow with the Makefile
     runs-on: ubuntu-20.04

diff --git a/.github/workflows/clang-format-check.yml b/.github/workflows/clang-format-check.yml
@@ -10,6 +10,7 @@ jobs:
           - check: "src"
             exclude: '\.proto$'
           - check: "include"
+          - check: "inference"
           - check: "nmt"
           - check: "python"
           - check: "scripts"