diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 183028b022..e8177cd9b7 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -10,6 +10,3 @@ Linked Issues:
 Issues closed by this PR:
 - Closes #
 
-**Before merging:**
-
-- [ ] Did you update the [flexflow-third-party](https://github.com/flexflow/flexflow-third-party) repo, if modifying any of the Cmake files, the build configs, or the submodules?
diff --git a/.github/README.md b/.github/README.md
new file mode 100644
index 0000000000..56434f6bf9
--- /dev/null
+++ b/.github/README.md
@@ -0,0 +1,230 @@
+# FlexFlow Serve: Low-Latency, High-Performance LLM Serving
+![build](https://github.com/flexflow/flexflow/workflows/build/badge.svg?branch=inference) ![gpu tests](https://github.com/flexflow/flexflow/workflows/gpu-ci/badge.svg?branch=inference) ![multinode gpu tests](https://github.com/flexflow/flexflow/workflows/multinode-test/badge.svg?branch=master) ![docker](https://github.com/flexflow/flexflow/workflows/docker-build/badge.svg?branch=inference) ![pip](https://github.com/flexflow/flexflow/workflows/pip-install/badge.svg?branch=inference) ![shell-check](https://github.com/flexflow/flexflow/workflows/Shell%20Check/badge.svg?branch=inference) ![clang-format](https://github.com/flexflow/flexflow/workflows/clang-format%20Check/badge.svg?branch=inference) [![Documentation Status](https://readthedocs.org/projects/flexflow/badge/?version=latest)](https://flexflow.readthedocs.io/en/latest/?badge=latest)
+
+
+---
+
+## News🔥:
+
+* [08/16/2023] Adding Starcoder model support
+* [08/14/2023] Released Dockerfile for different CUDA versions
+
+## What is FlexFlow Serve
+  
+The high computational and memory requirements of generative large language
+models (LLMs) make it challenging to serve them quickly and cheaply. 
+FlexFlow Serve is an open-source compiler and distributed system for 
+__low latency__, __high performance__ LLM serving. FlexFlow Serve outperforms 
+existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 
+1.4-2.4x for multi-node, multi-GPU inference.
+
+<p align="center">
+<img src="https://github.com/flexflow/FlexFlow/blob/inference/img/performance.png?raw=true" alt="Performance comparison" height="320"/>
+</p>
+
+
+## Install FlexFlow Serve
+
+
+### Requirements
+* OS: Linux
+* GPU backend: Hip-ROCm or CUDA
+	* CUDA version: 10.2 – 12.0
+	* NVIDIA compute capability: 6.0 or higher
+* Python: 3.6 or higher
+* Package dependencies: [see here](https://github.com/flexflow/FlexFlow/blob/inference/requirements.txt)
+
+### Install with pip
+You can install FlexFlow Serve using pip:
+
+```bash
+pip install flexflow
+```
+
+### Try it in Docker
+If you run into any issue during the install, or if you would like to use the C++ API without needing to install from source, you can also use our pre-built Docker package for different CUDA versions and the `hip_rocm` backend. To download and run our pre-built Docker container:
+
+```bash
+docker run --gpus all -it --rm --shm-size=8g ghcr.io/flexflow/flexflow-cuda-11.8:latest
+```
+
+To download a Docker container for a backend other than CUDA v11.8, you can replace the `cuda-11.8` suffix with any of the following backends: `cuda-11.1`, `cuda-11.2`, `cuda-11.3`, `cuda-11.5`, `cuda-11.6`, `cuda-11.7`, `cuda-11.8`, and `hip_rocm`). More info on the Docker images, with instructions to build a new image from source, or run with additional configurations, can be found [here](../docker/README.md).
+
+### Build from source
+
+You can install FlexFlow Serve from source code by building the inference branch of FlexFlow. Please follow these [instructions](https://flexflow.readthedocs.io/en/latest/installation.html).
+
+## Quickstart
+The following example shows how to deploy an LLM using FlexFlow Serve and accelerate its serving using [speculative inference](#speculative-inference). First, we import `flexflow.serve` and initialize the FlexFlow Serve runtime. Note that `memory_per_gpu` and `zero_copy_memory_per_node` specify the size of device memory on each GPU (in MB) and zero-copy memory on each node (in MB), respectively. 
+We need to make sure the aggregated GPU memory and zero-copy memory are **both** sufficient to store LLM parameters in non-offloading serving. FlexFlow Serve combines tensor and pipeline model parallelism for LLM serving.
+```python
+import flexflow.serve as ff
+
+ff.init(
+        num_gpus=4,
+        memory_per_gpu=14000,
+        zero_copy_memory_per_node=30000,
+        tensor_parallelism_degree=4,
+        pipeline_parallelism_degree=1
+    )
+```
+Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
+```python
+# Specify the LLM
+llm = ff.LLM("decapoda-research/llama-7b-hf")
+
+# Specify a list of SSMs (just one in this case)
+ssms=[]
+ssm = ff.SSM("JackFram/llama-68m")
+ssms.append(ssm)
+```
+Next, we declare the generation configuration and compile both the LLM and SSMs. Note that all SSMs should run in the **beam search** mode, and the LLM should run in the **tree verification** mode to verify the speculated tokens from SSMs.
+```python
+# Create the sampling configs
+generation_config = ff.GenerationConfig(
+    do_sample=False, temperature=0.9, topp=0.8, topk=1
+)
+
+# Compile the SSMs for inference and load the weights into memory
+for ssm in ssms:
+    ssm.compile(generation_config)
+
+# Compile the LLM for inference and load the weights into memory
+llm.compile(generation_config, ssms=ssms)
+```
+Finally, we call `llm.generate` to generate the output, which is organized as a list of `GenerationResult`, which include the output tokens and text.
+```python
+result = llm.generate("Here are some travel tips for Tokyo:\n")
+```
+
+### Incremental decoding
+<details>
+<summary>Expand here</summary>
+<br>
+
+```python
+import flexflow.serve as ff
+
+# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
+ff.init(
+        num_gpus=4,
+        memory_per_gpu=14000,
+        zero_copy_memory_per_node=30000,
+        tensor_parallelism_degree=4,
+        pipeline_parallelism_degree=1
+    )
+
+# Create the FlexFlow LLM
+llm = ff.LLM("decapoda-research/llama-7b-hf")
+
+# Create the sampling configs
+generation_config = ff.GenerationConfig(
+    do_sample=True, temperature=0.9, topp=0.8, topk=1
+)
+
+# Compile the LLM for inference and load the weights into memory
+llm.compile(generation_config)
+
+# Generation begins!
+result = llm.generate("Here are some travel tips for Tokyo:\n")
+```
+
+</details>
+
+### C++ interface
+If you'd like to use the C++ interface (mostly used for development and benchmarking purposes), you should install from source, and follow the instructions below. 
+
+<details>
+<summary>Expand here</summary>
+<br>
+
+#### Downloading models
+Before running FlexFlow Serve, you should manually download the LLM and SSM(s) model of interest using the [inference/utils/download_hf_model.py](https://github.com/flexflow/FlexFlow/blob/inference/inference/utils/download_hf_model.py) script (see example below). By default, the script will download all of a model's assets (weights, configs, tokenizer files, etc...) into the cache folder `~/.cache/flexflow`. If you would like to use a different folder, you can request that via the parameter `--cache-folder`.
+
+```bash
+python3 ./inference/utils/download_hf_model.py <HF model 1> <HF model 2> ...
+```
+
+#### Running the C++ examples
+A C++ example is available at [this folder](../inference/spec_infer/). After building FlexFlow Serve, the executable will be available at `/build_dir/inference/spec_infer/spec_infer`. You can use the following command-line arguments to run FlexFlow Serve:
+
+* `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
+* `-ll:fsize`: size of device memory on each GPU in MB
+* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
+* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
+* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
+* `-cache-folder`: the folder
+* `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used. 
+* `-prompt`: (optional) path to the prompt file. FlexFlow Serve expects a json format file for prompts. In addition, users can also use the following API for registering requests:
+* `-output-file`: (optional) filepath to use to save the output of the model, together with the generation latency
+
+For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.
+
+```bash
+./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
+```
+</details>
+
+## Speculative Inference
+A key technique that enables FlexFlow Serve to accelerate LLM serving is speculative
+inference, which combines various collectively boost-tuned small speculative
+models (SSMs) to jointly predict the LLM’s outputs; the predictions are organized as a
+token tree, whose nodes each represent a candidate token sequence. The correctness
+of all candidate token sequences represented by a token tree is verified against the
+LLM’s output in parallel using a novel tree-based parallel decoding mechanism.
+FlexFlow Serve uses an LLM as a token tree verifier instead of an incremental decoder,
+which largely reduces the end-to-end inference latency and computational requirement
+for serving generative LLMs while provably preserving model quality.
+
+<p align="center">
+<img src="https://github.com/flexflow/FlexFlow/blob/inference/img/spec_infer_demo.gif?raw=true" alt="A Speculative Inference Demo" width="630"/>
+</p>
+
+### Supported LLMs and SSMs
+
+FlexFlow Serve currently supports all HuggingFace models with the following architectures:
+* `LlamaForCausalLM` / `LLaMAForCausalLM` (e.g. LLaMA/LLaMA-2, Guanaco, Vicuna, Alpaca, ...)
+* `OPTForCausalLM` (models from the OPT family)
+* `RWForCausalLM` (models from the Falcon family)
+* `GPTBigCodeForCausalLM` (models from the Starcoder family)
+
+Below is a list of models that we have explicitly tested and for which a SSM may be available:
+
+| Model | Model id on HuggingFace | Boost-tuned SSMs |
+| :---- | :---- | :---- |
+| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| OPT-6.7B | facebook/opt-6.7b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-13B | facebook/opt-13b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-30B | facebook/opt-30b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-66B | facebook/opt-66b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| Falcon-7B | tiiuae/falcon-7b | |
+| Falcon-40B | tiiuae/falcon-40b | |
+| StarCoder-7B | bigcode/starcoderbase-7b | |
+| StarCoder-15.5B | bigcode/starcoder | |
+
+### CPU Offloading
+FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention). Besides, since the small model occupies considerably less space, it it does not pose a bottleneck for GPU memory, the offloading will bring more runtime space and computational cost, so we only do the offloading for the large model. You can run the offloading example by enabling the `-offload` and `-offload-reserve-space-size` flags.
+
+### Quantization
+FlexFlow Serve supports int4 and int8 quantization. The compressed tensors are stored on the CPU side. Once copied to the GPU, these tensors undergo decompression and conversion back to their original precision. Please find the compressed weight files in our s3 bucket, or use [this script](../inference/utils/compress_llama_weights.py) from [FlexGen](https://github.com/FMInference/FlexGen) project to do the compression manually.
+
+### Prompt Datasets
+We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruction prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatbot.json), [ChatGPT Prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json), [WebQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/webqa.json), [Alpaca](https://specinfer.s3.us-east-2.amazonaws.com/prompts/alpaca.json), and [PIQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/piqa.json).
+
+## TODOs
+
+FlexFlow Serve is under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.
+
+* AMD support. We are actively working on supporting FlexFlow Serve on AMD GPUs and welcome any contributions to this effort. 
+
+## Acknowledgements
+This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. 
+
+## License
+FlexFlow uses Apache License 2.0.
diff --git a/.github/workflows/build-skip.yml b/.github/workflows/build-skip.yml
index b3ab69e9c1..8635c0d137 100644
--- a/.github/workflows/build-skip.yml
+++ b/.github/workflows/build-skip.yml
@@ -3,6 +3,7 @@ on:
   pull_request:
     paths-ignore:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "deps/**"
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 1e7081a613..4e457ada1b 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -3,6 +3,7 @@ on:
   pull_request:
     paths:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "deps/**"
@@ -15,6 +16,7 @@ on:
       - "master"
     paths:
       - "include/**"
+      - "inference/**"
       - "cmake/**"
       - "config/**"
       - "deps/**"
@@ -146,6 +148,8 @@ jobs:
       matrix:
         gpu_backend: ["cuda", "hip_rocm"]
       fail-fast: false
+    env:
+      FF_GPU_BACKEND: ${{ matrix.gpu_backend }}
     steps:
       - name: Checkout Git Repository
         uses: actions/checkout@v3
@@ -157,6 +161,7 @@ jobs:
 
       - name: Install CUDA
         uses: Jimver/cuda-toolkit@v0.2.11
+        if: ${{ matrix.gpu_backend == 'cuda' }}
         id: cuda-toolkit
         with:
           cuda: "11.8.0"
@@ -164,7 +169,7 @@ jobs:
           use-github-cache: "false"
 
       - name: Install system dependencies
-        run: FF_GPU_BACKEND=${{ matrix.gpu_backend }} .github/workflows/helpers/install_dependencies.sh
+        run: .github/workflows/helpers/install_dependencies.sh
 
       - name: Install conda and FlexFlow dependencies
         uses: conda-incubator/setup-miniconda@v2
@@ -178,17 +183,25 @@ jobs:
           export CUDNN_DIR="$CUDA_PATH"
           export CUDA_DIR="$CUDA_PATH"
           export FF_HOME=$(pwd)
-          export FF_GPU_BACKEND=${{ matrix.gpu_backend }}
           export FF_CUDA_ARCH=70
+          export FF_HIP_ARCH=gfx1100,gfx1036
+          export hip_version=5.6
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+
+          if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
+            export FF_BUILD_ALL_EXAMPLES=ON
+            export FF_BUILD_UNIT_TESTS=ON
+          else 
+            export FF_BUILD_ALL_EXAMPLES=OFF
+            export FF_BUILD_UNIT_TESTS=OFF
+          fi
+
           cores_available=$(nproc --all)
           n_build_cores=$(( cores_available -1 ))
           if (( $n_build_cores < 1 )) ; then n_build_cores=1 ; fi
           mkdir build
           cd build
-          if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
-            export FF_BUILD_ALL_EXAMPLES=ON 
-            export FF_BUILD_UNIT_TESTS=ON
-          fi
+          
           ../config/config.linux
           make -j $n_build_cores
 
@@ -197,25 +210,24 @@ jobs:
           export CUDNN_DIR="$CUDA_PATH"
           export CUDA_DIR="$CUDA_PATH"
           export FF_HOME=$(pwd)
-          export FF_GPU_BACKEND=${{ matrix.gpu_backend }}
           export FF_CUDA_ARCH=70
-          cd build
+          export FF_HIP_ARCH=gfx1100,gfx1036
+          export hip_version=5.6
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+          
           if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
-            export FF_BUILD_ALL_EXAMPLES=ON 
+            export FF_BUILD_ALL_EXAMPLES=ON
             export FF_BUILD_UNIT_TESTS=ON
+          else 
+            export FF_BUILD_ALL_EXAMPLES=OFF
+            export FF_BUILD_UNIT_TESTS=OFF
           fi
+
+          cd build
           ../config/config.linux
           sudo make install
           sudo ldconfig
 
-      - name: Check availability of Python flexflow.core module
-        if: ${{ matrix.gpu_backend == 'cuda' }}
-        run: |
-          export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
-          sudo ln -s "$CUDA_PATH/lib64/stubs/libcuda.so" "$CUDA_PATH/lib64/stubs/libcuda.so.1"
-          export CPU_ONLY_TEST=1
-          python -c "import flexflow.core; exit()"
-
       - name: Run C++ unit tests
         if: ${{ matrix.gpu_backend == 'cuda' }}
         run: |
@@ -223,9 +235,20 @@ jobs:
           export CUDA_DIR="$CUDA_PATH"
           export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
           export FF_HOME=$(pwd)
+          sudo ln -s "$CUDA_PATH/lib64/stubs/libcuda.so" "$CUDA_PATH/lib64/stubs/libcuda.so.1"
           cd build
           ./tests/unit/unit-test
 
+      - name: Check availability of Python flexflow.core module
+        run: |
+          if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
+            export LD_LIBRARY_PATH="$CUDA_PATH/lib64/stubs:$LD_LIBRARY_PATH"
+          fi
+          # Remove build folder to check that the installed version can run independently of the build files
+          rm -rf build
+          export CPU_ONLY_TEST=1
+          python -c "import flexflow.core; exit()"
+
   makefile-build:
     name: Build FlexFlow with the Makefile
     runs-on: ubuntu-20.04
diff --git a/.github/workflows/clang-format-check.yml b/.github/workflows/clang-format-check.yml
index 46c9bf3be2..1601da86b3 100644
--- a/.github/workflows/clang-format-check.yml
+++ b/.github/workflows/clang-format-check.yml
@@ -10,6 +10,7 @@ jobs:
           - check: "src"
             exclude: '\.proto$'
           - check: "include"
+          - check: "inference"
           - check: "nmt"
           - check: "python"
           - check: "scripts"
diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml
index d059a0605f..b0ca251510 100644
--- a/.github/workflows/docker-build.yml
+++ b/.github/workflows/docker-build.yml
@@ -7,6 +7,7 @@ on:
       - ".github/workflows/docker-build.yml"
   push:
     branches:
+      - "inference"
       - "master"
   schedule:
     # Run every week on Sunday at midnight PT (3am ET / 8am UTC) to keep the docker images updated
@@ -25,25 +26,42 @@ jobs:
     strategy:
       matrix:
         gpu_backend: ["cuda", "hip_rocm"]
-        cuda_version: ["11.1", "11.2", "11.3", "11.5", "11.6", "11.7", "11.8"]
+        gpu_backend_version: ["11.1", "11.2", "11.3", "11.4", "11.5", "11.6", "11.7", "11.8", "12.0", "5.3", "5.4", "5.5", "5.6"]
         # The CUDA version doesn't matter when building for hip_rocm, so we just pick one arbitrarily (11.8) to avoid building for hip_rocm once per number of CUDA version supported
         exclude:
+          - gpu_backend: "cuda"
+            gpu_backend_version: "5.3"
+          - gpu_backend: "cuda"
+            gpu_backend_version: "5.4"
+          - gpu_backend: "cuda"
+            gpu_backend_version: "5.5"
+          - gpu_backend: "cuda"
+            gpu_backend_version: "5.6"
           - gpu_backend: "hip_rocm"
-            cuda_version: "11.1"
+            gpu_backend_version: "11.1"
           - gpu_backend: "hip_rocm"
-            cuda_version: "11.2"
+            gpu_backend_version: "11.2"
           - gpu_backend: "hip_rocm"
-            cuda_version: "11.3"
+            gpu_backend_version: "11.3"
           - gpu_backend: "hip_rocm"
-            cuda_version: "11.5"
+            gpu_backend_version: "11.4"
           - gpu_backend: "hip_rocm"
-            cuda_version: "11.6"
+            gpu_backend_version: "11.5"
           - gpu_backend: "hip_rocm"
-            cuda_version: "11.7"
+            gpu_backend_version: "11.6"
+          - gpu_backend: "hip_rocm"
+            gpu_backend_version: "11.7"
+          - gpu_backend: "hip_rocm"
+            gpu_backend_version: "11.8"
+          - gpu_backend: "hip_rocm"
+            gpu_backend_version: "12.0"
       fail-fast: false
     env:
       FF_GPU_BACKEND: ${{ matrix.gpu_backend }}
-      cuda_version: ${{ matrix.cuda_version }}
+      gpu_backend_version: ${{ matrix.gpu_backend_version }}
+      # one of the two variables below will be unused
+      cuda_version: ${{ matrix.gpu_backend_version }}
+      hip_version: ${{ matrix.gpu_backend_version }}
       branch_name: ${{ github.head_ref || github.ref_name }}
     steps:
       - name: Checkout Git Repository
@@ -53,8 +71,8 @@ jobs:
 
       - name: Free additional space on runner
         env:
-          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' ) && env.branch_name == 'inference' }}
-          build_needed: ${{ matrix.gpu_backend == 'hip_rocm' || ( matrix.gpu_backend == 'cuda' && matrix.cuda_version == '11.8' ) }}
+          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' ) && env.branch_name == 'inference' }}
+          build_needed: ${{ ( matrix.gpu_backend == 'hip_rocm' && matrix.gpu_backend_version == '5.6' ) || ( matrix.gpu_backend == 'cuda' && matrix.gpu_backend_version == '11.8' ) }}
         run: |
           if [[ $deploy_needed == "true" || $build_needed == "true" ]]; then
             .github/workflows/helpers/free_space_on_runner.sh
@@ -64,17 +82,19 @@ jobs:
 
       - name: Build Docker container
         env:
-          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' ) && env.branch_name == 'inference' }}
-          build_needed: ${{ matrix.gpu_backend == 'hip_rocm' || ( matrix.gpu_backend == 'cuda' && matrix.cuda_version == '11.8' ) }}
+          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' ) && env.branch_name == 'inference' }}
+          build_needed: ${{ ( matrix.gpu_backend == 'hip_rocm' && matrix.gpu_backend_version == '5.6' ) || ( matrix.gpu_backend == 'cuda' && matrix.gpu_backend_version == '11.8' ) }}
         run: |
           # On push to inference, build for all compatible architectures, so that we can publish 
           # a pre-built general-purpose image. On all other cases, only build for one architecture
           # to save time.
           if [[ $deploy_needed == "true" ]] ; then
             export FF_CUDA_ARCH=all
+            export FF_HIP_ARCH=all
             ./docker/build.sh flexflow
           elif [[ $build_needed == "true" ]]; then
             export FF_CUDA_ARCH=70
+            export FF_HIP_ARCH=gfx1100,gfx1036
             ./docker/build.sh flexflow
           else
             echo "Skipping build to save time"
@@ -83,11 +103,15 @@ jobs:
       - name: Check availability of Python flexflow.core module
         if: ${{ matrix.gpu_backend == 'cuda' }}
         env:
-          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' ) && env.branch_name == 'inference' }}
-          build_needed: ${{ matrix.gpu_backend == 'hip_rocm' || ( matrix.gpu_backend == 'cuda' && matrix.cuda_version == '11.8' ) }}
+          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' ) && env.branch_name == 'inference' }}
+          build_needed: ${{ ( matrix.gpu_backend == 'hip_rocm' && matrix.gpu_backend_version == '5.6' ) || ( matrix.gpu_backend == 'cuda' && matrix.gpu_backend_version == '11.8' ) }}
         run: |
           if [[ $deploy_needed == "true" || $build_needed == "true" ]]; then
-            docker run --env CPU_ONLY_TEST=1 --entrypoint /bin/bash flexflow-cuda-${cuda_version}:latest -c "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH; sudo ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1; python -c 'import flexflow.core; exit()'"
+            if [[ $FF_GPU_BACKEND == "cuda" ]]; then
+              docker run --env CPU_ONLY_TEST=1 --entrypoint /bin/bash flexflow-${FF_GPU_BACKEND}-${gpu_backend_version}:latest -c "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH; sudo ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1; python -c 'import flexflow.core; exit()'"
+            else
+              docker run --env CPU_ONLY_TEST=1 --entrypoint /bin/bash flexflow-${FF_GPU_BACKEND}-${gpu_backend_version}:latest -c "python -c 'import flexflow.core; exit()'"
+            fi
           else
             echo "Skipping test to save time"
           fi
@@ -96,7 +120,7 @@ jobs:
         if: github.repository_owner == 'flexflow'
         env:
           FLEXFLOW_CONTAINER_TOKEN: ${{ secrets.FLEXFLOW_CONTAINER_TOKEN }}
-          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' ) && env.branch_name == 'inference' }}
+          deploy_needed: ${{ ( github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' ) && env.branch_name == 'inference' }}
         run: |
           if [[ $deploy_needed == "true" ]]; then
             ./docker/publish.sh flexflow-environment
diff --git a/.github/workflows/gpu-ci-skip.yml b/.github/workflows/gpu-ci-skip.yml
index 157f3c271a..6a18e56bd1 100644
--- a/.github/workflows/gpu-ci-skip.yml
+++ b/.github/workflows/gpu-ci-skip.yml
@@ -8,9 +8,15 @@ on:
       - "python/**"
       - "setup.py"
       - "include/**"
+      - "inference/**"
       - "src/**"
+      - "tests/inference/**"
+      - "conda/flexflow.yml"
       - ".github/workflows/gpu-ci.yml"
+      - "tests/cpp_gpu_tests.sh"
+      - "tests/inference_tests.sh"
       - "tests/multi_gpu_tests.sh"
+      - "tests/python_interface_test.sh"
   workflow_dispatch:
 
 concurrency:
@@ -30,10 +36,18 @@ jobs:
     needs: gpu-ci-concierge
     steps:
       - run: 'echo "No gpu-ci required"'
+  
+  inference-tests:
+    name: Inference Tests
+    runs-on: ubuntu-20.04
+    needs: gpu-ci-concierge
+    steps:
+      - run: 'echo "No gpu-ci required"'
 
   gpu-ci-flexflow:
     name: Single Machine, Multiple GPUs Tests
     runs-on: ubuntu-20.04
-    needs: gpu-ci-concierge
+    # if: ${{ github.event_name != 'pull_request' || github.base_ref != 'inference' }}
+    needs: inference-tests
     steps:
       - run: 'echo "No gpu-ci required"'
diff --git a/.github/workflows/gpu-ci.yml b/.github/workflows/gpu-ci.yml
index 3b679e9f20..d604a7cea9 100644
--- a/.github/workflows/gpu-ci.yml
+++ b/.github/workflows/gpu-ci.yml
@@ -8,9 +8,13 @@ on:
       - "python/**"
       - "setup.py"
       - "include/**"
+      - "inference/**"
       - "src/**"
+      - "tests/inference/**"
+      - "conda/flexflow.yml"
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
+      - "tests/inference_tests.sh"
       - "tests/multi_gpu_tests.sh"
       - "tests/python_interface_test.sh"
   push:
@@ -23,9 +27,13 @@ on:
       - "python/**"
       - "setup.py"
       - "include/**"
+      - "inference/**"
       - "src/**"
+      - "tests/inference/**"
+      - "conda/flexflow.yml"
       - ".github/workflows/gpu-ci.yml"
       - "tests/cpp_gpu_tests.sh"
+      - "tests/inference_tests.sh"
       - "tests/multi_gpu_tests.sh"
       - "tests/python_interface_test.sh"
   workflow_dispatch:
@@ -77,7 +85,7 @@ jobs:
         with:
           miniconda-version: "latest"
           activate-environment: flexflow
-          environment-file: conda/flexflow-cpu.yml
+          environment-file: conda/flexflow.yml
           auto-activate-base: false
           auto-update-conda: false
 
@@ -89,7 +97,7 @@ jobs:
         run: |
           export PATH=$CONDA_PREFIX/bin:$PATH
           export FF_HOME=$(pwd)
-          export FF_USE_PREBUILT_LEGION=OFF
+          export FF_USE_PREBUILT_LEGION=OFF #remove this after fixing python path issue in Legion
           mkdir build
           cd build
           ../config/config.linux
@@ -106,6 +114,7 @@ jobs:
         run: |
           export PATH=$CONDA_PREFIX/bin:$PATH
           export FF_HOME=$(pwd)
+          export FF_USE_PREBUILT_LEGION=OFF #remove this after fixing python path issue in Legion
           cd build
           ../config/config.linux
           make install
@@ -124,27 +133,119 @@ jobs:
           export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
           ./tests/align/test_all_operators.sh
 
+  inference-tests:
+    name: Inference Tests
+    runs-on: self-hosted
+    defaults:
+      run:
+        shell: bash -l {0} # required to use an activated conda environment
+    env: 
+      CONDA: "3"    
+    needs: gpu-ci-concierge
+    container:
+      image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
+      options: --gpus all --shm-size=8192m
+    steps:
+      - name: Install updated git version
+        run: sudo add-apt-repository ppa:git-core/ppa -y && sudo apt update -y && sudo apt install -y --no-install-recommends git
+
+      - name: Checkout Git Repository
+        uses: actions/checkout@v3
+        with:
+          submodules: recursive
+          
+      - name: Install conda and FlexFlow dependencies
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          miniconda-version: "latest"
+          activate-environment: flexflow
+          environment-file: conda/flexflow.yml
+          auto-activate-base: false
+
+      - name: Build FlexFlow
+        run: |
+          export PATH=$CONDA_PREFIX/bin:$PATH
+          export FF_HOME=$(pwd)
+          export FF_USE_PREBUILT_LEGION=OFF #remove this after fixing python path issue in Legion
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+          mkdir build
+          cd build
+          ../config/config.linux
+          make -j
+
+      - name: Run inference tests
+        env:
+          CPP_INFERENCE_TESTS: ${{ vars.CPP_INFERENCE_TESTS }}
+        run: |
+          export PATH=$CONDA_PREFIX/bin:$PATH
+          export FF_HOME=$(pwd)
+          export CUDNN_DIR=/usr/local/cuda
+          export CUDA_DIR=/usr/local/cuda
+          export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib
+          
+          # GPT tokenizer test
+          ./tests/gpt_tokenizer_test.sh
+
+          # Inference tests
+          source ./build/set_python_envs.sh
+          ./tests/inference_tests.sh
+      
+      - name: Save inference output as an artifact
+        if: always()
+        run: | 
+          cd inference
+          tar -zcvf output.tar.gz ./output
+
+      - name: Upload artifact
+        uses: actions/upload-artifact@v3
+        if: always()
+        with:
+          name: output
+          path: inference/output.tar.gz
+      
+      # Github persists the .cache folder across different runs/containers
+      - name: Clear cache
+        if: always()
+        run: sudo rm -rf ~/.cache 
+
   gpu-ci-flexflow:
     name: Single Machine, Multiple GPUs Tests
     runs-on: self-hosted
-    needs: python-interface-check
+    # skip this time-consuming test for PRs to the inference branch
+    # if: ${{ github.event_name != 'pull_request' || github.base_ref != 'inference' }}
+    defaults:
+      run:
+        shell: bash -l {0} # required to use an activated conda environment
+    env: 
+      CONDA: "3"
+    needs: inference-tests
     container:
-      image: ghcr.io/flexflow/flexflow-environment-cuda-11.8:latest
+      image: ghcr.io/flexflow/flexflow-environment-cuda:latest
       options: --gpus all --shm-size=8192m
     steps:
       - name: Install updated git version
         run: sudo add-apt-repository ppa:git-core/ppa -y && sudo apt update -y && sudo apt install -y --no-install-recommends git
+      
       - name: Checkout Git Repository
         uses: actions/checkout@v3
         with:
           submodules: recursive
+      
+      - name: Install conda and FlexFlow dependencies
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          miniconda-version: "latest"
+          activate-environment: flexflow
+          environment-file: conda/flexflow.yml
+          auto-activate-base: false
 
       - name: Build and Install FlexFlow
         run: |
           export PATH=/opt/conda/bin:$PATH
           export FF_HOME=$(pwd)
           export FF_BUILD_ALL_EXAMPLES=ON
-          export FF_USE_PREBUILT_LEGION=OFF
+          export FF_BUILD_ALL_INFERENCE_EXAMPLES=ON
+          export FF_USE_PREBUILT_LEGION=OFF #remove this after fixing python path issue in Legion
           pip install . --verbose
 
       - name: Check FlexFlow Python interface (pip)
diff --git a/.github/workflows/helpers/install_cudnn.sh b/.github/workflows/helpers/install_cudnn.sh
index 318134e331..75e59109eb 100755
--- a/.github/workflows/helpers/install_cudnn.sh
+++ b/.github/workflows/helpers/install_cudnn.sh
@@ -44,6 +44,9 @@ elif [[ "$cuda_version" == "11.7" ]]; then
 elif [[ "$cuda_version" == "11.8" ]]; then
     CUDNN_LINK=https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
     CUDNN_TARBALL_NAME=cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz
+elif [[ "$cuda_version" == "11.8" ]]; then
+    echo "CUDNN support for CUDA version 12.0 not yet added"
+    exit 1
 fi
 wget -c -q $CUDNN_LINK
 if [[ "$cuda_version" == "11.6" || "$cuda_version" == "11.7" || "$cuda_version" == "11.8" ]]; then
diff --git a/.github/workflows/helpers/install_dependencies.sh b/.github/workflows/helpers/install_dependencies.sh
index 5ab211c962..1357882b5d 100755
--- a/.github/workflows/helpers/install_dependencies.sh
+++ b/.github/workflows/helpers/install_dependencies.sh
@@ -10,21 +10,56 @@ echo "Installing apt dependencies..."
 sudo apt-get update && sudo apt-get install -y --no-install-recommends wget binutils git zlib1g-dev libhdf5-dev && \
     sudo rm -rf /var/lib/apt/lists/*
 
-# Install CUDNN
-./install_cudnn.sh
-
-# Install HIP dependencies if needed
 FF_GPU_BACKEND=${FF_GPU_BACKEND:-"cuda"}
+hip_version=${hip_version:-"5.6"}
 if [[ "${FF_GPU_BACKEND}" != @(cuda|hip_cuda|hip_rocm|intel) ]]; then
   echo "Error, value of FF_GPU_BACKEND (${FF_GPU_BACKEND}) is invalid."
   exit 1
-elif [[ "$FF_GPU_BACKEND" == "hip_cuda" || "$FF_GPU_BACKEND" = "hip_rocm" ]]; then
+fi
+# Install CUDNN if needed
+if [[ "$FF_GPU_BACKEND" == "cuda" || "$FF_GPU_BACKEND" = "hip_cuda" ]]; then
+    # Install CUDNN
+    ./install_cudnn.sh
+fi
+# Install HIP dependencies if needed
+if [[ "$FF_GPU_BACKEND" == "hip_cuda" || "$FF_GPU_BACKEND" = "hip_rocm" ]]; then
     echo "FF_GPU_BACKEND: ${FF_GPU_BACKEND}. Installing HIP dependencies"
-    wget https://repo.radeon.com/amdgpu-install/22.20.5/ubuntu/focal/amdgpu-install_22.20.50205-1_all.deb
-    sudo apt-get install -y ./amdgpu-install_22.20.50205-1_all.deb
-    rm ./amdgpu-install_22.20.50205-1_all.deb
+    # Check that hip_version is one of 5.3,5.4,5.5,5.6
+    if [[ "$hip_version" != "5.3" && "$hip_version" != "5.4" && "$hip_version" != "5.5" && "$hip_version" != "5.6" ]]; then
+        echo "hip_version '${hip_version}' is not supported, please choose among {5.3, 5.4, 5.5, 5.6}"
+        exit 1
+    fi
+    # Compute script name and url given the version
+    AMD_GPU_SCRIPT_NAME=amdgpu-install_5.6.50600-1_all.deb
+    if [ "$hip_version" = "5.3" ]; then
+        AMD_GPU_SCRIPT_NAME=amdgpu-install_5.3.50300-1_all.deb
+    elif [ "$hip_version" = "5.4" ]; then
+        AMD_GPU_SCRIPT_NAME=amdgpu-install_5.4.50400-1_all.deb
+    elif [ "$hip_version" = "5.5" ]; then
+        AMD_GPU_SCRIPT_NAME=amdgpu-install_5.5.50500-1_all.deb
+    fi
+    AMD_GPU_SCRIPT_URL="https://repo.radeon.com/amdgpu-install/${hip_version}/ubuntu/focal/${AMD_GPU_SCRIPT_NAME}"
+    # Download and install AMD GPU software with ROCM and HIP support
+    wget "$AMD_GPU_SCRIPT_URL"
+    sudo apt-get install -y ./${AMD_GPU_SCRIPT_NAME}
+    sudo rm ./${AMD_GPU_SCRIPT_NAME}
     sudo amdgpu-install -y --usecase=hip,rocm --no-dkms
-    sudo apt-get install -y hip-dev hipblas miopen-hip rocm-hip-sdk
+    sudo apt-get install -y hip-dev hipblas miopen-hip rocm-hip-sdk rocm-device-libs
+
+    # Install protobuf v3.20.x manually
+    sudo apt-get update -y && sudo apt-get install -y pkg-config zip g++ zlib1g-dev unzip python autoconf automake libtool curl make
+    git clone -b 3.20.x https://github.com/protocolbuffers/protobuf.git
+    cd protobuf/
+    git submodule update --init --recursive
+    ./autogen.sh
+    ./configure
+    cores_available=$(nproc --all)
+    n_build_cores=$(( cores_available -1 ))
+    if (( n_build_cores < 1 )) ; then n_build_cores=1 ; fi
+    make -j $n_build_cores
+    sudo make install
+    sudo ldconfig
+    cd ..
 else
     echo "FF_GPU_BACKEND: ${FF_GPU_BACKEND}. Skipping installing HIP dependencies"
 fi
diff --git a/.github/workflows/pip-install-skip.yml b/.github/workflows/pip-install-skip.yml
index f2606b94d8..92c3223e32 100644
--- a/.github/workflows/pip-install-skip.yml
+++ b/.github/workflows/pip-install-skip.yml
@@ -7,6 +7,7 @@ on:
       - "deps/**"
       - "python/**"
       - "setup.py"
+      - "requirements.txt"
       - ".github/workflows/helpers/install_dependencies.sh"
       - ".github/workflows/pip-install.yml"
   workflow_dispatch:
diff --git a/.github/workflows/pip-install.yml b/.github/workflows/pip-install.yml
index 7d60d3bf52..695ed9857b 100644
--- a/.github/workflows/pip-install.yml
+++ b/.github/workflows/pip-install.yml
@@ -7,6 +7,7 @@ on:
       - "deps/**"
       - "python/**"
       - "setup.py"
+      - "requirements.txt"
       - ".github/workflows/helpers/install_dependencies.sh"
       - ".github/workflows/pip-install.yml"
   push:
@@ -18,6 +19,7 @@ on:
       - "deps/**"
       - "python/**"
       - "setup.py"
+      - "requirements.txt"
       - ".github/workflows/helpers/install_dependencies.sh"
       - ".github/workflows/pip-install.yml"
   workflow_dispatch:
@@ -64,6 +66,8 @@ jobs:
           export FF_HOME=$(pwd)
           export FF_CUDA_ARCH=70
           pip install . --verbose
+          # Remove build folder to check that the installed version can run independently of the build files
+          rm -rf build
 
       - name: Check availability of Python flexflow.core module
         run: |
diff --git a/.gitignore b/.gitignore
index 20d3979b08..be0266c9b5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -15,6 +15,11 @@ __pycache__/
 # C extensions
 *.so
 
+/inference/weights/*
+/inference/tokenizer/*
+/inference/prompt/*
+/inference/output/*
+
 # Distribution / packaging
 .Python
 build/
@@ -83,10 +88,7 @@ docs/build/
 
 # Doxygen documentation
 docs/doxygen/output/
-
-# Exhale documentation
-docs/source/_doxygen/
-docs/source/c++_api/
+docs/doxygen/cpp_api/
 
 # PyBuilder
 .pybuilder/
@@ -179,6 +181,7 @@ train-labels-idx1-ubyte
 
 # Logs
 logs/
+gpt_tokenizer
 
 # pip version
 python/flexflow/version.txt
diff --git a/.gitmodules b/.gitmodules
index b8419fda94..c68582d4ac 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -19,3 +19,7 @@
 [submodule "deps/json"]
 	path = deps/json
 	url = https://github.com/nlohmann/json.git
+[submodule "deps/tokenizers-cpp"]
+	path = deps/tokenizers-cpp
+	url = https://github.com/mlc-ai/tokenizers-cpp.git
+	fetchRecurseSubmodules = true
\ No newline at end of file
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 894be712e4..90df628a79 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -12,7 +12,16 @@ if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.24.0")
 endif()
 set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} ${CMAKE_CURRENT_LIST_DIR}/cmake)
 set(FLEXFLOW_ROOT ${CMAKE_CURRENT_LIST_DIR})
-set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -UNDEBUG")
+set(CMAKE_CXX_FLAGS "-std=c++17 ${CMAKE_CXX_FLAGS} -fPIC -UNDEBUG")
+
+option(INFERENCE_TESTS "Run inference tests" OFF)
+set(LIBTORCH_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../libtorch" CACHE STRING "LibTorch Path")
+if (INFERENCE_TESTS)
+  find_package(Torch REQUIRED PATHS ${LIBTORCH_PATH} NO_DEFAULT_PATH)
+  set(CMAKE_CXX_FLAGS "-std=c++17 ${CMAKE_CXX_FLAGS} -fPIC ${TORCH_CXX_FLAGS}")
+  message(STATUS "LIBTORCH_PATH: ${LIBTORCH_PATH}")
+  message(STATUS "TORCH_LIBRARIES: ${TORCH_LIBRARIES}")
+endif()
 
 # Set a default build type if none was specified
 set(default_build_type "Debug")
@@ -154,9 +163,14 @@ set_property(CACHE FF_GPU_BACKEND PROPERTY STRINGS ${FF_GPU_BACKENDS})
 
 # option for cuda arch
 set(FF_CUDA_ARCH "autodetect" CACHE STRING "Target CUDA Arch")
-if (FF_CUDA_ARCH STREQUAL "")
+if ((FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda") AND FF_CUDA_ARCH STREQUAL "")
   message(FATAL_ERROR "FF_CUDA_ARCH cannot be an empty string. Set it to `autodetect`, `all`, or pass one or multiple valid CUDA archs.")
 endif()
+# option for hip arch
+set(FF_HIP_ARCH "all" CACHE STRING "Target HIP Arch")
+if (FF_GPU_BACKEND STREQUAL "hip_rocm" AND FF_CUDA_ARCH STREQUAL "")
+  message(FATAL_ERROR "FF_HIP_ARCH cannot be an empty string. Set it to `all`, or pass one or multiple valid HIP archs.")
+endif()
 
 # option for nccl
 option(FF_USE_NCCL "Run FlexFlow with NCCL" OFF)
@@ -173,6 +187,7 @@ set(FF_MAX_DIM "4" CACHE STRING "Maximum dimention of tensors")
 
 # option for legion
 option(FF_USE_EXTERNAL_LEGION "Use pre-installed Legion" OFF)
+set(LEGION_MAX_RETURN_SIZE "32768" CACHE STRING "Maximum Legion return size")
 
 set(FLEXFLOW_EXT_LIBRARIES "")
 set(FLEXFLOW_INCLUDE_DIRS "")
@@ -184,10 +199,9 @@ set(LD_FLAGS $ENV{LD_FLAGS})
 
 # Set global FLAGS
 list(APPEND CC_FLAGS
-  -std=c++11)
-
+  -std=c++17)
 list(APPEND NVCC_FLAGS
-  -std=c++11)
+  -std=c++17)
 
 add_compile_options(${CC_FLAGS})
 set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} ${NVCC_FLAGS})
@@ -220,12 +234,25 @@ if (FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
   include(cuda)
 endif()
 
+# HIP
+if (FF_GPU_BACKEND STREQUAL "hip_rocm" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
+  include(hip)
+endif()
+
 # CUDNN
 if (FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
   include(cudnn)
 endif()
 
-# legion
+# Inference tests
+if(INFERENCE_TESTS)
+  list(APPEND FF_CC_FLAGS
+    -DINFERENCE_TESTS)
+  list(APPEND FF_NVCC_FLAGS
+    -DINFERENCE_TESTS)
+endif()
+
+# Legion
 include(legion)
 
 # Not build FlexFlow if BUILD_LEGION_ONLY is ON
@@ -275,9 +302,11 @@ if(NOT BUILD_LEGION_ONLY)
   endif()
 
   message(STATUS "FlexFlow MAX_DIM: ${FF_MAX_DIM}")
+  message(STATUS "LEGION_MAX_RETURN_SIZE: ${LEGION_MAX_RETURN_SIZE}")
 
   list(APPEND FF_CC_FLAGS
-    -DMAX_TENSOR_DIM=${FF_MAX_DIM})
+    -DMAX_TENSOR_DIM=${FF_MAX_DIM}
+    -DLEGION_MAX_RETURN_SIZE=${LEGION_MAX_RETURN_SIZE})
 
   if(FF_USE_AVX2)
     list(APPEND FF_CC_FLAGS
@@ -287,12 +316,14 @@ if(NOT BUILD_LEGION_ONLY)
 
   list(APPEND FF_NVCC_FLAGS
     -Wno-deprecated-gpu-targets
-    -DMAX_TENSOR_DIM=${FF_MAX_DIM})
+    -DMAX_TENSOR_DIM=${FF_MAX_DIM}
+    -DLEGION_MAX_RETURN_SIZE=${LEGION_MAX_RETURN_SIZE})
 
   list(APPEND FF_LD_FLAGS
     -lrt
     -ldl
-    -rdynamic)
+    -rdynamic
+    -lstdc++fs)
 
   # Set FF FLAGS
   add_compile_options(${FF_CC_FLAGS})
@@ -306,11 +337,15 @@ if(NOT BUILD_LEGION_ONLY)
   file(GLOB_RECURSE FLEXFLOW_HDR
     LIST_DIRECTORIES False
     ${FLEXFLOW_ROOT}/include/*.h)
+  
+  list(APPEND FLEXFLOW_HDR ${FLEXFLOW_ROOT}/inference/file_loader.h)
 
   file(GLOB_RECURSE FLEXFLOW_SRC
     LIST_DIRECTORIES False
     ${FLEXFLOW_ROOT}/src/*.cc)
+  
   list(REMOVE_ITEM FLEXFLOW_SRC "${FLEXFLOW_ROOT}/src/runtime/cpp_driver.cc")
+  list(APPEND FLEXFLOW_SRC ${FLEXFLOW_ROOT}/inference/file_loader.cc)
 
   set(FLEXFLOW_CPP_DRV_SRC
     ${FLEXFLOW_ROOT}/src/runtime/cpp_driver.cc)
@@ -379,6 +414,18 @@ if(NOT BUILD_LEGION_ONLY)
 
       add_compile_definitions(FF_USE_HIP_ROCM)
 
+      if (FF_HIP_ARCH STREQUAL "")
+        message(FATAL_ERROR "FF_HIP_ARCH is undefined")
+      endif()
+      set_property(TARGET flexflow PROPERTY HIP_ARCHITECTURES "${HIP_ARCH_LIST}")
+
+      message(STATUS "FF_GPU_BACKEND: ${FF_GPU_BACKEND}")
+      message(STATUS "FF_HIP_ARCH: ${FF_HIP_ARCH}")
+      message(STATUS "HIP_ARCH_LIST: ${HIP_ARCH_LIST}")
+      get_property(CHECK_HIP_ARCHS TARGET flexflow PROPERTY HIP_ARCHITECTURES)
+      message(STATUS "CHECK_HIP_ARCHS: ${CHECK_HIP_ARCHS}")
+      message(STATUS "HIP_CLANG_PATH: ${HIP_CLANG_PATH}")
+
       # The hip cmake config module defines three targets, 
       # hip::amdhip64, hip::host, and hip::device.
       #
@@ -456,30 +503,38 @@ if(NOT BUILD_LEGION_ONLY)
     endif()
   endif()
 
-  # build binary
-  option(FF_BUILD_RESNET "build resnet example" OFF)
-  option(FF_BUILD_RESNEXT "build resnext example" OFF)
-  option(FF_BUILD_ALEXNET "build alexnet example" OFF)
-  option(FF_BUILD_DLRM "build DLRM example" OFF)
-  option(FF_BUILD_XDL "build XDL example" OFF)
-  option(FF_BUILD_INCEPTION "build inception example" OFF)
-  option(FF_BUILD_CANDLE_UNO "build candle uno example" OFF)
-  option(FF_BUILD_TRANSFORMER "build transformer example" OFF)
-  option(FF_BUILD_MOE "build mixture of experts example" OFF)
-  option(FF_BUILD_MLP_UNIFY "build mlp unify example" OFF)
-  option(FF_BUILD_SPLIT_TEST "build split test example" OFF)
-  option(FF_BUILD_SPLIT_TEST_2 "build split test 2 example" OFF)
-  option(FF_BUILD_ALL_EXAMPLES "build all examples. Overrides others" OFF)
-  option(FF_BUILD_UNIT_TESTS "build non-operator unit tests" OFF)
-  option(FF_BUILD_SUBSTITUTION_TOOL "build substitution conversion tool" OFF)
-  option(FF_BUILD_VISUALIZATION_TOOL "build substitution visualization tool" OFF)
-
-  if(FF_BUILD_UNIT_TESTS)
-    set(BUILD_GMOCK OFF)
-    add_subdirectory(deps/googletest)
-    enable_testing()
-    add_subdirectory(tests/unit)
-  endif()
+if (INFERENCE_TESTS)
+  target_link_libraries(flexflow "${TORCH_LIBRARIES}")
+  set_property(TARGET flexflow PROPERTY CXX_STANDARD 14)
+endif()
+
+# build binary
+option(FF_BUILD_TOKENIZER "build tokenizer=cpp for LLM serving" ON)
+option(FF_BUILD_RESNET "build resnet example" OFF)
+option(FF_BUILD_RESNEXT "build resnext example" OFF)
+option(FF_BUILD_ALEXNET "build alexnet example" OFF)
+option(FF_BUILD_DLRM "build DLRM example" OFF)
+option(FF_BUILD_XDL "build XDL example" OFF)
+option(FF_BUILD_INCEPTION "build inception example" OFF)
+option(FF_BUILD_CANDLE_UNO "build candle uno example" OFF)
+option(FF_BUILD_TRANSFORMER "build transformer example" OFF)
+option(FF_BUILD_MOE "build mixture of experts example" OFF)
+option(FF_BUILD_MLP_UNIFY "build mlp unify example" OFF)
+option(FF_BUILD_SPLIT_TEST "build split test example" OFF)
+option(FF_BUILD_SPLIT_TEST_2 "build split test 2 example" OFF)
+option(FF_BUILD_MLP_UNIFY_INFERENCE "build mlp unify inference example" OFF)
+option(FF_BUILD_ALL_INFERENCE_EXAMPLES "build all inference examples. Overrides others" OFF)
+option(FF_BUILD_ALL_EXAMPLES "build all examples. Overrides others" OFF)
+option(FF_BUILD_UNIT_TESTS "build non-operator unit tests" OFF)
+option(FF_BUILD_SUBSTITUTION_TOOL "build substitution conversion tool" OFF)
+option(FF_BUILD_VISUALIZATION_TOOL "build substitution visualization tool" OFF)
+
+if(FF_BUILD_UNIT_TESTS)
+  set(BUILD_GMOCK OFF)
+  add_subdirectory(deps/googletest)
+  enable_testing()
+  add_subdirectory(tests/unit)
+endif()
 
   if(FF_BUILD_SUBSTITUTION_TOOL)
     add_subdirectory(tools/protobuf_to_json)
@@ -489,86 +544,113 @@ if(NOT BUILD_LEGION_ONLY)
     add_subdirectory(tools/substitutions_to_dot)
   endif()
 
-  if(FF_BUILD_RESNET OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/ResNet)
+if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_TOKENIZER)
+  if (FF_GPU_BACKEND STREQUAL "hip_rocm")
+    SET(SPM_USE_BUILTIN_PROTOBUF OFF CACHE BOOL "Use builtin version of protobuf to compile SentencePiece")
   endif()
-
-  if(FF_BUILD_RESNEXT OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/resnext50)
+  # Ensure Rust is installed
+  execute_process(COMMAND rustc --version
+                RESULT_VARIABLE RUST_COMMAND_RESULT
+                OUTPUT_VARIABLE RUSTC_OUTPUT
+                ERROR_QUIET)
+  if(NOT RUST_COMMAND_RESULT EQUAL 0)
+    message(FATAL_ERROR "Rust is not installed on the system. Please install it by running: 'curl https://sh.rustup.rs -sSf | sh -s -- -y' and following the instructions on the screen.")
   endif()
-
-  if(FF_BUILD_ALEXNET OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/AlexNet)
+  # Ensure Cargo is installed
+  execute_process(COMMAND cargo --version
+                  RESULT_VARIABLE CARGO_RESULT
+                  OUTPUT_QUIET ERROR_QUIET)
+  if(NOT CARGO_RESULT EQUAL 0)
+    message(FATAL_ERROR "Rust is installed, but cargo is not. Please install it by running: 'curl https://sh.rustup.rs -sSf | sh -s -- -y' and following the instructions on the screen.")
   endif()
+  add_subdirectory(deps/tokenizers-cpp tokenizers EXCLUDE_FROM_ALL)
+  target_include_directories(flexflow PUBLIC deps/tokenizers-cpp/include)
+  target_link_libraries(flexflow tokenizers_cpp)
+endif()
+if(FF_BUILD_RESNET OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/ResNet)
+endif()
 
-  if(FF_BUILD_MLP_UNIFY OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/MLP_Unify)
-  endif()
+if(FF_BUILD_RESNEXT OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/resnext50)
+endif()
 
-  if(FF_BUILD_SPLIT_TEST OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/split_test)
-  endif()
+if(FF_BUILD_ALEXNET OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/AlexNet)
+endif()
 
-  if(FF_BUILD_SPLIT_TEST_2 OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/split_test_2)
-  endif()
+if(FF_BUILD_MLP_UNIFY OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/MLP_Unify)
+endif()
 
-  if(FF_BUILD_INCEPTION OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/InceptionV3)
-  endif()
+if(FF_BUILD_SPLIT_TEST OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/split_test)
+endif()
 
-  #TODO: Once functional add to BUILD_ALL_EXAMPLES
-  if(FF_BUILD_CANDLE_UNO OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/candle_uno)
-  endif()
+if(FF_BUILD_SPLIT_TEST_2 OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/split_test_2)
+endif()
 
-  if(FF_BUILD_DLRM OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/DLRM)
+if(FF_BUILD_INCEPTION OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/InceptionV3)
+endif()
 
-    #add_executable(generate_dlrm_hetero_strategy src/runtime/dlrm_strategy_hetero.cc)
-    #target_include_directories(generate_dlrm_hetero_strategy PUBLIC ${FLEXFLOW_INCLUDE_DIRS})
+#TODO: Once functional add to BUILD_ALL_EXAMPLES
+if(FF_BUILD_CANDLE_UNO OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/candle_uno)
+endif()
 
-    #add_executable(generate_dlrm_strategy src/runtime/dlrm_strategy.cc)
-    #target_include_directories(generate_dlrm_strategy PUBLIC ${FLEXFLOW_INCLUDE_DIRS})
-  endif()
+if(FF_BUILD_DLRM OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/DLRM)
 
-  if(FF_BUILD_XDL OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/XDL)
-  endif()
+  #add_executable(generate_dlrm_hetero_strategy src/runtime/dlrm_strategy_hetero.cc)
+  #target_include_directories(generate_dlrm_hetero_strategy PUBLIC ${FLEXFLOW_INCLUDE_DIRS})
 
-  if(FF_BUILD_TRANSFORMER OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/Transformer)
-  endif()
+  #add_executable(generate_dlrm_strategy src/runtime/dlrm_strategy.cc)
+  #target_include_directories(generate_dlrm_strategy PUBLIC ${FLEXFLOW_INCLUDE_DIRS})
+endif()
 
-  if(FF_BUILD_MOE OR FF_BUILD_ALL_EXAMPLES)
-    add_subdirectory(examples/cpp/mixture_of_experts)
-  endif()
+if(FF_BUILD_XDL OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/XDL)
+endif()
 
-  # installation
-  set(INCLUDE_DEST "include")
-  set(LIB_DEST "lib")
-  install(FILES ${FLEXFLOW_HDR} DESTINATION ${INCLUDE_DEST})
-  install(TARGETS flexflow DESTINATION ${LIB_DEST})
-  # install python
-  if (FF_USE_PYTHON)
-    execute_process(COMMAND ${PYTHON_EXECUTABLE} -c "from distutils import sysconfig; print(sysconfig.get_python_lib(plat_specific=False,standard_lib=False))" OUTPUT_VARIABLE PY_DEST OUTPUT_STRIP_TRAILING_WHITESPACE)
-    if (NOT FF_BUILD_FROM_PYPI)
-      install(
-        DIRECTORY ${FLEXFLOW_ROOT}/python/flexflow/
-        DESTINATION ${PY_DEST}/flexflow
-        FILES_MATCHING 
-        PATTERN "*.py")
-    else()
-      # pip automatically installs all *.py files in the python/flexflow folder, but because flexflow_cffi_header.py is generated at build time, we have to install it manually.
-      install(
-        PROGRAMS ${FLEXFLOW_ROOT}/python/flexflow/core/flexflow_cffi_header.py
-        DESTINATION ${PY_DEST}/flexflow/core
-      )
-      # Use setup.py script to re-install the Python bindings library with the right library paths. 
-      # Need to put the instructions in a subfolder because of issue below:
-      # https://stackoverflow.com/questions/43875499/do-post-processing-after-make-install-in-cmake
-      add_subdirectory(cmake/pip_install)
-    endif()
-  endif()
+if(FF_BUILD_TRANSFORMER OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/Transformer)
+endif()
+
+if(FF_BUILD_MOE OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(examples/cpp/mixture_of_experts)
+endif()
 
+if(FF_BUILD_ALL_INFERENCE_EXAMPLES OR FF_BUILD_ALL_EXAMPLES)
+  add_subdirectory(inference/spec_infer)
+  add_subdirectory(inference/incr_decoding)
+endif()
+
+
+# installation
+set(INCLUDE_DEST "include")
+set(LIB_DEST "lib")
+install(FILES ${FLEXFLOW_HDR} DESTINATION ${INCLUDE_DEST})
+install(TARGETS flexflow DESTINATION ${LIB_DEST})
+# install python
+if (FF_USE_PYTHON)
+  execute_process(COMMAND ${PYTHON_EXECUTABLE} -c "from distutils import sysconfig; print(sysconfig.get_python_lib(plat_specific=False,standard_lib=False))" OUTPUT_VARIABLE PY_DEST OUTPUT_STRIP_TRAILING_WHITESPACE)
+  if (NOT FF_BUILD_FROM_PYPI)
+    install(
+      DIRECTORY ${FLEXFLOW_ROOT}/python/flexflow/
+      DESTINATION ${PY_DEST}/flexflow
+      FILES_MATCHING 
+      PATTERN "*.py")
+  else()
+    # pip automatically installs all *.py files in the python/flexflow folder, but because flexflow_cffi_header.py is generated at build time, we have to install it manually.
+    install(
+      PROGRAMS ${FLEXFLOW_ROOT}/python/flexflow/core/flexflow_cffi_header.py
+      DESTINATION ${PY_DEST}/flexflow/core
+    )
+    # Use setup.py script to re-install the Python bindings library with the right library paths. 
+    # Need to put the instructions in a subfolder because of issue below:
+    # https://stackoverflow.com/questions/43875499/do-post-processing-after-make-install-in-cmake
+    add_subdirectory(cmake/pip_install)
+  endif()
 endif()
\ No newline at end of file
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index e607fddb1a..c3c0b5173f 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -119,7 +119,26 @@ After adding the DNN layers, the next step before compiling the model for traini
 
 #### Model compilation
 
-TODO
+Model compilation consists of the following steps:
+
+1. We initialize an operator for each layer in the model, via the function `create_operators_from_layers()`. Layers work with `Tensor` input/weights/outputs, and are created directly by the user when writing a FlexFlow program. Operators work with `ParallelTensor` objects and they are responsible for running computations by launching kernels on GPUs.  
+2. Launch the graph optimize task (`GRAPH_OPTIMIZE_TASK_ID`), implemented by`PCG::Graph::graph_optimize_task`, which returns `PCG::GraphOptimalViewSerialized`
+	1. call `deserialize_graph_optimal_view(...)` to get `PCG::Graph *best_graph` and `std::unordered_map<PCG::Node, MachineView> optimal_views` from deserialized `PCG::GraphOptimalViewSerialized`
+	2. `convert_graph_to_operators()`
+	3. print the dot of the best graph obtained
+	4. map inputs to parallel tensor and weights to parallel tensor? -> strange for loop to understand better
+3. Init performance metrics via the `FFModel::update_metrics_task` 
+4. Perform inplace optimizations (if enabled)
+5. Loop through the operators to do the following (to be understood better):
+	1. `parameters.push_back(op->weights[i]);` for each weight in each operator
+	2. `op->map_output_tensors(*this);`
+	3. `((ParallelOp *)op)->create_input_partition(*this);` if the operator is a parallel operator
+6. Check correctness of the operator's input and output tensors' settings
+7. Perform fusion optimizations, if enabled
+8. Print all operators and their input and output regions
+9. Create the tensor for the label
+10. Initialize the optimizer
+11. In training mode, if NCCL is enabled, initialize all the communicators and other objects
 
 
 ## Continuous Integration
@@ -281,6 +300,10 @@ We want to make contributing to this project as easy and transparent as possible
 ### Formatting
 We use `clang-format` to format our C++ code. If you make changes to the code and the Clang format CI test is failing, you can lint your code by running: `./scripts/format.sh` from the main folder of this repo.
 
+### Documenting the code
+We follow the Python Docstring conventions for documenting the Python code. We document the C++ code using comments in any of the conventioned supported by Doxygen [see here](https://doxygen.nl/manual/docblocks.html).
+
+
 ### Pull Requests
 We actively welcome your pull requests.
 
diff --git a/FlexFlow.mk b/FlexFlow.mk
index b434045893..14f32a7639 100644
--- a/FlexFlow.mk
+++ b/FlexFlow.mk
@@ -59,7 +59,8 @@ GEN_SRC += $(shell find $(FF_HOME)/src/loss_functions/ -name '*.cc')\
 		$(shell find $(FF_HOME)/src/runtime/ -name '*.cc')\
 		$(shell find $(FF_HOME)/src/utils/dot/ -name '*.cc')\
 		$(shell find $(FF_HOME)/src/dataloader/ -name '*.cc')\
-		$(shell find $(FF_HOME)/src/c/ -name '*.cc')
+		$(shell find $(FF_HOME)/src/c/ -name '*.cc')\
+		$(shell find $(FF_HOME)/inference/ -name 'file_loader.cc')
 GEN_SRC := $(filter-out $(FF_HOME)/src/runtime/cpp_driver.cc, $(GEN_SRC))
 
 FF_CUDA_SRC += $(shell find $(FF_HOME)/src/loss_functions/ -name '*.cu')\
@@ -94,15 +95,17 @@ ifneq ($(strip $(FF_USE_PYTHON)), 1)
 endif
 
 
-INC_FLAGS	+= -I${FF_HOME}/include -I${FF_HOME}/deps/optional/include -I${FF_HOME}/deps/variant/include -I${FF_HOME}/deps/json/include
+INC_FLAGS	+= -I${FF_HOME}/include -I${FF_HOME}/inference -I${FF_HOME}/deps/optional/include -I${FF_HOME}/deps/variant/include -I${FF_HOME}/deps/json/include -I${FF_HOME}/deps/tokenizers-cpp/include -I${FF_HOME}/deps/tokenizers-cpp/sentencepiece/src
 CC_FLAGS	+= -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768
 NVCC_FLAGS	+= -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768
 HIPCC_FLAGS     += -DMAX_TENSOR_DIM=$(MAX_DIM) -DLEGION_MAX_RETURN_SIZE=32768
 GASNET_FLAGS	+=
 # For Point and Rect typedefs
-CC_FLAGS	+= -std=c++11
-NVCC_FLAGS	+= -std=c++11
-HIPCC_FLAGS     += -std=c++11
+CC_FLAGS	+= -std=c++17
+NVCC_FLAGS	+= -std=c++17
+HIPCC_FLAGS     += -std=c++17
+
+LD_FLAGS += -L$(FF_HOME)/deps/tokenizers-cpp/example/tokenizers -ltokenizers_cpp -ltokenizers_c -L$(FF_HOME)/deps/tokenizers-cpp/example/tokenizers/sentencepiece/src -lsentencepiece
 
 ifeq ($(strip $(FF_USE_NCCL)), 1)
 INC_FLAGS	+= -I$(MPI_HOME)/include -I$(NCCL_HOME)/include
diff --git a/INSTALL.md b/INSTALL.md
index d2e3c1d2f6..8d33770c92 100644
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -1,4 +1,4 @@
-# Installing FlexFlow
+# Building from source
 To build and install FlexFlow, follow the instructions below.
 
 ## 1. Download the source code
@@ -85,10 +85,11 @@ export FF_HOME=/path/to/FlexFlow
 ### Run FlexFlow Python examples
 The Python examples are in the [examples/python](https://github.com/flexflow/FlexFlow/tree/master/examples/python). The native, Keras integration and PyTorch integration examples are listed in `native`, `keras` and `pytorch` respectively.
 
-To run the Python examples, you have two options: you can use the `flexflow_python` interpreter, available in the `build` folder, or you can use the native Python interpreter. If you choose to use the native Python interpreter, you should either install FlexFlow, or, if you prefer to build without installing, export the following flags:
+To run the Python examples, you have two options: you can use the `flexflow_python` interpreter, available in the `build` folder, or you can use the native Python interpreter. If you choose to use the native Python interpreter, you should either install FlexFlow, or, if you prefer to build without installing, export the required environment flags by running the following command (edit the path if your build folder is not named `build`):
 
-* `export PYTHONPATH="${FF_HOME}/python:${FF_HOME}/build/deps/legion/bindings/python:${PYTHONPATH}"`
-* `export LD_LIBRARY_PATH="${FF_HOME}/build:${FF_HOME}/build/deps/legion/lib:${LD_LIBRARY_PATH}"`
+```
+source ./build/set_python_envs.sh
+```
 
 **We recommend that you run the** `mnist_mlp` **test under** `native` **using the following cmd to check if FlexFlow has been installed correctly:**
 
diff --git a/MULTI-NODE.md b/MULTI-NODE.md
index a8fd2fb705..4bae47cfa6 100644
--- a/MULTI-NODE.md
+++ b/MULTI-NODE.md
@@ -68,4 +68,4 @@ ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOy5NKYdE8Cwgid59rx6xMqyj9vLaWuXIwy/BSRiK4su
 
 Follow step 6 in [INSTALL.md](INSTALL.md) to set environment variables.
 
-A script to run a Python example on multiple nodes is available at `scripts/mnist_mlp_run.sh`. You can run the script using [`mpirun`](https://www.open-mpi.org/doc/current/man1/mpirun.1.php) (if you configured it in step 3) or [`srun`](https://slurm.schedmd.com/srun.html).
\ No newline at end of file
+A script to run a Python example on multiple nodes is available at `scripts/mnist_mlp_run.sh`. You can run the script using [`mpirun`](https://www.open-mpi.org/doc/current/man1/mpirun.1.php) (if you configured it in step 3) or [`srun`](https://slurm.schedmd.com/srun.html).
diff --git a/README.md b/README.md
index 9ad900fb3c..e84bf20605 100644
--- a/README.md
+++ b/README.md
@@ -1,72 +1,53 @@
-# FlexFlow
-![build](https://github.com/flexflow/flexflow/workflows/build/badge.svg?branch=master) ![gpu tests](https://github.com/flexflow/flexflow/workflows/gpu-ci/badge.svg?branch=master) ![multinode gpu tests](https://github.com/flexflow/flexflow/workflows/multinode-test/badge.svg?branch=master) ![docker](https://github.com/flexflow/flexflow/workflows/docker-build/badge.svg?branch=master) ![pip](https://github.com/flexflow/flexflow/workflows/pip-install/badge.svg?branch=master) ![shell-check](https://github.com/flexflow/flexflow/workflows/Shell%20Check/badge.svg?branch=master) ![clang-format](https://github.com/flexflow/flexflow/workflows/clang-format%20Check/badge.svg?branch=master) [![Documentation Status](https://readthedocs.org/projects/flexflow/badge/?version=latest)](https://flexflow.readthedocs.io/en/latest/?badge=latest)
+# FlexFlow: Low-Latency, High-Performance Training and Serving
+![build](https://github.com/flexflow/flexflow/workflows/build/badge.svg?branch=inference) ![gpu tests](https://github.com/flexflow/flexflow/workflows/gpu-ci/badge.svg?branch=inference) ![multinode gpu tests](https://github.com/flexflow/flexflow/workflows/multinode-test/badge.svg?branch=master) ![docker](https://github.com/flexflow/flexflow/workflows/docker-build/badge.svg?branch=inference) ![pip](https://github.com/flexflow/flexflow/workflows/pip-install/badge.svg?branch=inference) ![shell-check](https://github.com/flexflow/flexflow/workflows/Shell%20Check/badge.svg?branch=inference) ![clang-format](https://github.com/flexflow/flexflow/workflows/clang-format%20Check/badge.svg?branch=inference) [![Documentation Status](https://readthedocs.org/projects/flexflow/badge/?version=latest)](https://flexflow.readthedocs.io/en/latest/?badge=latest)
 
-FlexFlow is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies. FlexFlow provides a drop-in replacement for PyTorch and TensorFlow Keras. Running existing PyTorch and Keras programs in FlexFlow only requires [a few lines of changes to the program](https://flexflow.ai/keras).
 
-## Install FlexFlow
-To install FlexFlow from source code, please read the [instructions](https://flexflow.readthedocs.io/en/latest/installation.html). If you would like to quickly try FlexFlow, we also provide pre-built Docker packages for several versions of CUDA and for the `hip_rocm` backend, together with [Dockerfiles](./docker) if you wish to build the containers manually. More info on the Docker images can be found [here](./docker/README.md). You can also use `conda` to install the FlexFlow Python package (coming soon).
+---
 
-## PyTorch Support
-Users can also use FlexFlow to optimize the parallelization performance of existing PyTorch models in two steps. First, a PyTorch model can be exported to the FlexFlow model format using `flexflow.torch.fx.torch_to_flexflow`.
-```python
-import torch
-import flexflow.torch.fx as fx
+## News 🔥:
 
-model = MyPyTorchModule()
-fx.torch_to_flexflow(model, "mymodel.ff")
-```
+* [08/16/2023] Adding Starcoder model support
+* [08/14/2023] Released Dockerfile for different CUDA versions
+
+## Install FlexFlow
 
-Second, a FlexFlow program can directly import a previously saved PyTorch model and [autotune](https://www.usenix.org/conference/osdi22/presentation/unger) the parallelization performance for a given parallel machine.
 
-```python
-from flexflow.pytorch.model import PyTorchModel
+### Requirements
+* OS: Linux
+* GPU backend: Hip-ROCm or CUDA
+  * CUDA version: 10.2 – 12.0
+  * NVIDIA compute capability: 6.0 or higher
+* Python: 3.6 or higher
+* Package dependencies: [see here](https://github.com/flexflow/FlexFlow/blob/inference/requirements.txt)
 
-def top_level_task():
-  torch_model = PyTorchModel("mymodel.ff")
-  output_tensor = torch_model.apply(ffmodel, input_tensor)
-  ## Model compilation
-  ffmodel.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
-  ## Model training
-  (x_train, y_train) = cifar10.load_data()
-  ffmodel.fit(x_train, y_train, epochs=30)
+### Install with pip
+You can install FlexFlow using pip:
+
+```bash
+pip install flexflow
 ```
 
-**More FlexFlow PyTorch examples**: see the [pytorch examples folder](https://github.com/flexflow/FlexFlow/tree/master/examples/python/pytorch).
+### Try it in Docker
+If you run into any issue during the install, or if you would like to use the C++ API without needing to install from source, you can also use our pre-built Docker package for different CUDA versions and the `hip_rocm` backend. To download and run our pre-built Docker container:
+
+```bash
+docker run --gpus all -it --rm --shm-size=8g ghcr.io/flexflow/flexflow-cuda-11.8:latest
+```
 
-## TensorFlow Keras and ONNX Support
-FlexFlow prioritizes PyTorch compatibility, but also includes frontends for [Tensorflow Keras](./docs/source/keras.rst) and [ONNX](./docs/source/onnx.rst) models.
+To download a Docker container for a backend other than CUDA v11.8, you can replace the `cuda-11.8` suffix with any of the following backends: `cuda-11.1`, `cuda-11.2`, `cuda-11.3`, `cuda-11.5`, `cuda-11.6`, `cuda-11.7`, `cuda-11.8`, and `hip_rocm`). More info on the Docker images, with instructions to build a new image from source, or run with additional configurations, can be found [here](../docker/README.md).
 
-## C++ Interface
-For users that prefer to program in C/C++. FlexFlow supports a C++ program inference that is equivalent to its Python APIs.
+### Build from source
 
-**More FlexFlow C++ examples**: see the [C++ examples folder](https://github.com/flexflow/FlexFlow/tree/master/examples/cpp).
+You can install FlexFlow Serve from source code by building the inference branch of FlexFlow. Please follow these [instructions](https://flexflow.readthedocs.io/en/latest/installation.html).
 
 
-## Command-Line Flags
-In addition to setting runtime configurations in a FlexFlow Python/C++ program, the FlexFlow runtime also accepts command-line arguments for various runtime parameters: 
+## Get Started!
 
-FlexFlow training flags:
-* `-e` or `--epochs`: number of total epochs to run (default: 1)
-* `-b` or `--batch-size`: global batch size in each iteration (default: 64)
-* `-p` or `--print-freq`: print frequency (default: 10)
-* `-d` or `--dataset`: path to the training dataset. If not set, synthetic data is used to conduct training.
+To get started, check out the quickstart guides below for the FlexFlow training and serving libraries.
 
-Legion runtime flags:
-* `-ll:gpu`: number of GPU processors to use on each node (default: 0)
-* `-ll:fsize`: size of device memory on each GPU (in MB)
-* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) on each node (in MB). This is used for prefecthing training images from disk.
-* `-ll:cpu`: number of data loading workers (default: 4)
-* `-ll:util`: number of utility threads to create per process (default: 1)
-* `-ll:bgwork`: number of background worker threads to create per process (default: 1)
+* [FlexFlow Train](./TRAIN.md)
+* [FlexFlow Serve](./SERVE.md)
 
-Performance auto-tuning flags:
-* `--search-budget` or `--budget`: the number of iterations for the MCMC search (default: 0)
-* `--search-alpha` or `--alpha`: a hyper-parameter for the search procedure (default: 0.05)
-* `--export-strategy` or `--export`: path to export the best discovered strategy (default: None)
-* `--import-strategy` or `--import`: path to import a previous saved strategy (default: None)
-* `--enable-parameter-parallel`: allow FlexFlow to explore parameter parallelism for performance auto-tuning. (By default FlexFlow only considers data and model parallelism.)
-* `--enable-attribute-parallel`: allow FlexFlow to explore attribute parallelism for performance auto-tuning. (By default FlexFlow only considers data and model parallelism.)
-For performance tuning related flags: see [performance autotuning](https://flexflow.ai/search).
 
 ## Contributing
 
@@ -75,6 +56,14 @@ Please let us know if you encounter any bugs or have any suggestions by [submitt
 We welcome all contributions to FlexFlow from bug fixes to new features and extensions.
 
 ## Citations
+
+**FlexFlow Serve:**
+
+* Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia. [SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification](https://arxiv.org/abs/2305.09781). In ArXiV, May 2023.
+
+
+**FlexFlow Train:**
+
 * Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. [Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization](https://www.usenix.org/conference/osdi22/presentation/unger). In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), July 2022. 
 
 * Zhihao Jia, Matei Zaharia, and Alex Aiken. [Beyond Data and Model Parallelism for Deep Neural Networks](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf). In Proceedings of the 2nd Conference on Machine Learning and Systems (MLSys), April 2019.
@@ -86,3 +75,4 @@ FlexFlow is developed and maintained by teams at CMU, Facebook, Los Alamos Natio
 
 ## License
 FlexFlow uses Apache License 2.0.
+
diff --git a/SERVE.md b/SERVE.md
new file mode 100644
index 0000000000..e716392b32
--- /dev/null
+++ b/SERVE.md
@@ -0,0 +1,209 @@
+# FlexFlow Serve: Low-Latency, High-Performance LLM Serving
+
+
+## What is FlexFlow Serve
+  
+The high computational and memory requirements of generative large language
+models (LLMs) make it challenging to serve them quickly and cheaply. 
+FlexFlow Serve is an open-source compiler and distributed system for 
+__low latency__, __high performance__ LLM serving. FlexFlow Serve outperforms 
+existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 
+1.4-2.4x for multi-node, multi-GPU inference.
+
+<p align="center">
+<img src="https://github.com/flexflow/FlexFlow/blob/inference/img/performance.png?raw=true" alt="Performance comparison" height="320"/>
+</p>
+
+
+## Quickstart
+The following example shows how to deploy an LLM using FlexFlow Serve and accelerate its serving using [speculative inference](#speculative-inference). First, we import `flexflow.serve` and initialize the FlexFlow Serve runtime. Note that `memory_per_gpu` and `zero_copy_memory_per_node` specify the size of device memory on each GPU (in MB) and zero-copy memory on each node (in MB), respectively. FlexFlow Serve combines tensor and pipeline model parallelism for LLM serving.
+```python
+import flexflow.serve as ff
+
+ff.init(
+    {
+        "num_gpus": 4,
+        "memory_per_gpu": 14000,
+        "zero_copy_memory_per_node": 30000,
+        "tensor_parallelism_degree": 4,
+        "pipeline_parallelism_degree": 1,
+    }
+)
+```
+Second, we specify the LLM to serve and the SSM(s) used to accelerate LLM serving. The list of supported LLMs and SSMs is available at [supported models](#supported-llms-and-ssms).
+```python
+# Specify the LLM
+llm = ff.LLM("decapoda-research/llama-7b-hf")
+
+# Specify a list of SSMs (just one in this case)
+ssms=[]
+ssm = ff.SSM("JackFram/llama-68m")
+ssms.append(ssm)
+```
+Next, we declare the generation configuration and compile both the LLM and SSMs. Note that all SSMs should run in the **beam search** mode, and the LLM should run in the **tree verification** mode to verify the speculated tokens from SSMs.
+```python
+# Create the sampling configs
+generation_config = ff.GenerationConfig(
+    do_sample=False, temperature=0.9, topp=0.8, topk=1
+)
+
+# Compile the SSMs for inference and load the weights into memory
+for ssm in ssms:
+    ssm.compile(generation_config)
+
+# Compile the LLM for inference and load the weights into memory
+llm.compile(generation_config, ssms=ssms)
+```
+Finally, we call `llm.generate` to generate the output, which is organized as a list of `GenerationResult`, which include the output tokens and text.
+```python
+result = llm.generate("Here are some travel tips for Tokyo:\n")
+```
+
+### Incremental decoding
+
+<details>
+<summary>Expand here</summary>
+<br>
+
+```python
+
+import flexflow.serve as ff
+
+# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
+ff.init(
+    {
+        "num_gpus": 4,
+        "memory_per_gpu": 14000,
+        "zero_copy_memory_per_gpu": 30000,
+        "tensor_parallelism_degree": 4,
+        "pipeline_parallelism_degree": 1,
+    }
+)
+
+# Create the FlexFlow LLM
+llm = ff.LLM("decapoda-research/llama-7b-hf")
+
+# Create the sampling configs
+generation_config = ff.GenerationConfig(
+    do_sample=True, temperature=0.9, topp=0.8, topk=1
+)
+
+# Compile the LLM for inference and load the weights into memory
+llm.compile(generation_config)
+
+# Generation begins!
+result = llm.generate("Here are some travel tips for Tokyo:\n")
+
+```
+
+</details>
+
+### C++ interface
+If you'd like to use the C++ interface (mostly used for development and benchmarking purposes), you should install from source, and follow the instructions below. 
+
+<details>
+<summary>Expand here</summary>
+<br>
+
+#### Downloading models
+
+Before running FlexFlow Serve, you should manually download the LLM and SSM(s) model of interest using the [inference/utils/download_hf_model.py](https://github.com/flexflow/FlexFlow/blob/inference/inference/utils/download_hf_model.py) script (see example below). By default, the script will download all of a model's assets (weights, configs, tokenizer files, etc...) into the cache folder `~/.cache/flexflow`. If you would like to use a different folder, you can request that via the parameter `--cache-folder`.
+
+```bash
+python3 ./inference/utils/download_hf_model.py <HF model 1> <HF model 2> ...
+```
+
+#### Running the C++ examples
+A C++ example is available at [this folder](../inference/spec_infer/). After building FlexFlow Serve, the executable will be available at `/build_dir/inference/spec_infer/spec_infer`. You can use the following command-line arguments to run FlexFlow Serve:
+
+* `-ll:gpu`: number of GPU processors to use on each node for serving an LLM (default: 0)
+* `-ll:fsize`: size of device memory on each GPU in MB
+* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) in MB. FlexFlow Serve keeps a replica of the LLM parameters on zero-copy memory, and therefore requires that the zero-copy memory is sufficient for storing the LLM parameters.
+* `-llm-model`: the LLM model ID from HuggingFace (e.g. "decapoda-research/llama-7b-hf")
+* `-ssm-model`: the SSM model ID from HuggingFace (e.g. "JackFram/llama-160m"). You can use multiple `-ssm-model`s in the command line to launch multiple SSMs.
+* `-cache-folder`: the folder
+* `-data-parallelism-degree`, `-tensor-parallelism-degree` and `-pipeline-parallelism-degree`: parallelization degrees in the data, tensor, and pipeline dimensions. Their product must equal the number of GPUs available on the machine. When any of the three parallelism degree arguments is omitted, a default value of 1 will be used. 
+* `-prompt`: (optional) path to the prompt file. FlexFlow Serve expects a json format file for prompts. In addition, users can also use the following API for registering requests:
+* `-output-file`: (optional) filepath to use to save the output of the model, together with the generation latency
+
+For example, you can use the following command line to serve a LLaMA-7B or LLaMA-13B model on 4 GPUs and use two collectively boost-tuned LLaMA-68M models for speculative inference.
+
+```bash
+./inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-68m -prompt /path/to/prompt.json -tensor-parallelism-degree 4 --fusion
+```
+</details>
+
+## Speculative Inference
+A key technique that enables FlexFlow Serve to accelerate LLM serving is speculative
+inference, which combines various collectively boost-tuned small speculative
+models (SSMs) to jointly predict the LLM’s outputs; the predictions are organized as a
+token tree, whose nodes each represent a candidate token sequence. The correctness
+of all candidate token sequences represented by a token tree is verified against the
+LLM’s output in parallel using a novel tree-based parallel decoding mechanism.
+FlexFlow Serve uses an LLM as a token tree verifier instead of an incremental decoder,
+which largely reduces the end-to-end inference latency and computational requirement
+for serving generative LLMs while provably preserving model quality.
+
+<p align="center">
+<img src="https://github.com/flexflow/FlexFlow/blob/inference/img/spec_infer_demo.gif?raw=true" alt="A Speculative Inference Demo" width="630"/>
+</p>
+
+### Supported LLMs and SSMs
+
+FlexFlow Serve currently supports all HuggingFace models with the following architectures:
+* `LlamaForCausalLM` / `LLaMAForCausalLM` (e.g. LLaMA/LLaMA-2, Guanaco, Vicuna, Alpaca, ...)
+* `OPTForCausalLM` (models from the OPT family)
+* `RWForCausalLM` (models from the Falcon family)
+* `GPTBigCodeForCausalLM` (models from the Starcoder family)
+
+Below is a list of models that we have explicitly tested and for which a SSM may be available:
+
+| Model | Model id on HuggingFace | Boost-tuned SSMs |
+| :---- | :---- | :---- |
+| LLaMA-7B | decapoda-research/llama-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-13B | decapoda-research/llama-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-30B | decapoda-research/llama-30b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-65B | decapoda-research/llama-65b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-7B | meta-llama/Llama-2-7b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-13B | meta-llama/Llama-2-13b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| LLaMA-2-70B | meta-llama/Llama-2-70b-hf | [LLaMA-68M](https://huggingface.co/JackFram/llama-68m) , [LLaMA-160M](https://huggingface.co/JackFram/llama-160m) |
+| OPT-6.7B | facebook/opt-6.7b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-13B | facebook/opt-13b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-30B | facebook/opt-30b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| OPT-66B | facebook/opt-66b | [OPT-125M](https://huggingface.co/facebook/opt-125m) |
+| Falcon-7B | tiiuae/falcon-7b | |
+| Falcon-40B | tiiuae/falcon-40b | |
+| StarCoder-15.5B | bigcode/starcoder | |
+
+
+### CPU Offloading
+FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention). Besides, since the small model occupies considerably less space, it it does not pose a bottleneck for GPU memory, the offloading will bring more runtime space and computational cost, so we only do the offloading for the large model. [TODO: update instructions] You can run the offloading example by enabling the `-offload` and `-offload-reserve-space-size` flags.
+
+### Quantization
+FlexFlow Serve supports int4 and int8 quantization. The compressed tensors are stored on the CPU side. Once copied to the GPU, these tensors undergo decompression and conversion back to their original precision. Please find the compressed weight files in our s3 bucket, or use [this script](../inference/utils/compress_llama_weights.py) from [FlexGen](https://github.com/FMInference/FlexGen) project to do the compression manually. [TODO: update instructions for quantization].
+
+### Prompt Datasets
+We provide five prompt datasets for evaluating FlexFlow Serve: [Chatbot instruction prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatbot.json), [ChatGPT Prompts](https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json), [WebQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/webqa.json), [Alpaca](https://specinfer.s3.us-east-2.amazonaws.com/prompts/alpaca.json), and [PIQA](https://specinfer.s3.us-east-2.amazonaws.com/prompts/piqa.json).
+
+## TODOs
+
+FlexFlow Serve is still under active development. We currently focus on the following tasks and strongly welcome all contributions from bug fixes to new features and extensions.
+
+* AMD support. We are actively working on supporting FlexFlow Serve on AMD GPUs and welcome any contributions to this effort. 
+
+## Acknowledgements
+This project is initiated by members from CMU, Stanford, and UCSD. We will be continuing developing and supporting FlexFlow Serve. Please cite FlexFlow Serve as:
+
+``` bibtex
+@misc{miao2023specinfer,
+      title={SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification}, 
+      author={Xupeng Miao and Gabriele Oliaro and Zhihao Zhang and Xinhao Cheng and Zeyu Wang and Rae Ying Yee Wong and Alan Zhu and Lijie Yang and Xiaoxiang Shi and Chunan Shi and Zhuoming Chen and Daiyaan Arfeen and Reyna Abhyankar and Zhihao Jia},
+      year={2023},
+      eprint={2305.09781},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+## License
+FlexFlow uses Apache License 2.0.
diff --git a/TRAIN.md b/TRAIN.md
new file mode 100644
index 0000000000..1595274a4c
--- /dev/null
+++ b/TRAIN.md
@@ -0,0 +1,65 @@
+# FlexFlow Train: Distributed DNN Training with Flexible Parallelization Strategies.
+FlexFlow Train is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies. FlexFlow Train provides a drop-in replacement for PyTorch and TensorFlow Keras. Running existing PyTorch and Keras programs in FlexFlow oTrain nly requires [a few lines of changes to the program](https://flexflow.ai/keras).
+
+
+## PyTorch Support
+Users can also use FlexFlow Train to optimize the parallelization performance of existing PyTorch models in two steps. First, a PyTorch model can be exported to the FlexFlow model format using `flexflow.torch.fx.torch_to_flexflow`.
+```python
+import torch
+import flexflow.torch.fx as fx
+
+model = MyPyTorchModule()
+fx.torch_to_flexflow(model, "mymodel.ff")
+```
+
+Second, a FlexFlow Train program can directly import a previously saved PyTorch model and [autotune](https://www.usenix.org/conference/osdi22/presentation/unger) the parallelization performance for a given parallel machine.
+
+```python
+from flexflow.pytorch.model import PyTorchModel
+
+def top_level_task():
+  torch_model = PyTorchModel("mymodel.ff")
+  output_tensor = torch_model.apply(ffmodel, input_tensor)
+  ## Model compilation
+  ffmodel.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
+  ## Model training
+  (x_train, y_train) = cifar10.load_data()
+  ffmodel.fit(x_train, y_train, epochs=30)
+```
+
+**More FlexFlow PyTorch examples**: see the [pytorch examples folder](https://github.com/flexflow/FlexFlow/tree/master/examples/python/pytorch).
+
+## TensorFlow Keras and ONNX Support
+FlexFlow Train prioritizes PyTorch compatibility, but also includes frontends for [Tensorflow Keras](./docs/source/keras.rst) and [ONNX](./docs/source/onnx.rst) models.
+
+## C++ Interface
+For users that prefer to program in C/C++. FlexFlow Train supports a C++ program inference that is equivalent to its Python APIs.
+
+**More FlexFlow C++ examples**: see the [C++ examples folder](https://github.com/flexflow/FlexFlow/tree/master/examples/cpp).
+
+
+## Command-Line Flags
+In addition to setting runtime configurations in a FlexFlow Train Python/C++ program, the FlexFlow Train runtime also accepts command-line arguments for various runtime parameters: 
+
+FlexFlow training flags:
+* `-e` or `--epochs`: number of total epochs to run (default: 1)
+* `-b` or `--batch-size`: global batch size in each iteration (default: 64)
+* `-p` or `--print-freq`: print frequency (default: 10)
+* `-d` or `--dataset`: path to the training dataset. If not set, synthetic data is used to conduct training.
+
+Legion runtime flags:
+* `-ll:gpu`: number of GPU processors to use on each node (default: 0)
+* `-ll:fsize`: size of device memory on each GPU (in MB)
+* `-ll:zsize`: size of zero-copy memory (pinned DRAM with direct GPU access) on each node (in MB). This is used for prefecthing training images from disk.
+* `-ll:cpu`: number of data loading workers (default: 4)
+* `-ll:util`: number of utility threads to create per process (default: 1)
+* `-ll:bgwork`: number of background worker threads to create per process (default: 1)
+
+Performance auto-tuning flags:
+* `--search-budget` or `--budget`: the number of iterations for the MCMC search (default: 0)
+* `--search-alpha` or `--alpha`: a hyper-parameter for the search procedure (default: 0.05)
+* `--export-strategy` or `--export`: path to export the best discovered strategy (default: None)
+* `--import-strategy` or `--import`: path to import a previous saved strategy (default: None)
+* `--enable-parameter-parallel`: allow FlexFlow Train to explore parameter parallelism for performance auto-tuning. (By default FlexFlow Train only considers data and model parallelism.)
+* `--enable-attribute-parallel`: allow FlexFlow Train to explore attribute parallelism for performance auto-tuning. (By default FlexFlow Train only considers data and model parallelism.)
+For performance tuning related flags: see [performance autotuning](https://flexflow.ai/search).
diff --git a/cmake/hip.cmake b/cmake/hip.cmake
new file mode 100644
index 0000000000..b32d68d608
--- /dev/null
+++ b/cmake/hip.cmake
@@ -0,0 +1,11 @@
+if (NOT FF_HIP_ARCH STREQUAL "")
+    if (FF_HIP_ARCH STREQUAL "all")
+        set(FF_HIP_ARCH "gfx900,gfx902,gfx904,gfx906,gfx908,gfx909,gfx90a,gfx90c,gfx940,gfx1010,gfx1011,gfx1012,gfx1013,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103")
+    endif()
+    string(REPLACE "," " " HIP_ARCH_LIST "${FF_HIP_ARCH}")
+endif()
+
+message(STATUS "FF_HIP_ARCH: ${FF_HIP_ARCH}")
+if(FF_GPU_BACKEND STREQUAL "hip_rocm")
+    set(HIP_CLANG_PATH ${ROCM_PATH}/llvm/bin CACHE STRING "Path to the clang compiler by ROCM" FORCE)
+endif()
diff --git a/cmake/legion.cmake b/cmake/legion.cmake
index b4cfad20e2..b83cbc52f2 100644
--- a/cmake/legion.cmake
+++ b/cmake/legion.cmake
@@ -142,8 +142,11 @@ else()
 			set(Legion_USE_HIP ON CACHE BOOL "enable Legion_USE_HIP" FORCE)
 			if (FF_GPU_BACKEND STREQUAL "hip_cuda")
 				set(Legion_HIP_TARGET "CUDA" CACHE STRING "Legion_HIP_TARGET CUDA" FORCE)
+				set(Legion_CUDA_ARCH ${FF_CUDA_ARCH} CACHE STRING "Legion CUDA ARCH" FORCE)
 			elseif(FF_GPU_BACKEND STREQUAL "hip_rocm")
 				set(Legion_HIP_TARGET "ROCM" CACHE STRING "Legion HIP_TARGET ROCM" FORCE)
+				set(Legion_HIP_ARCH ${FF_HIP_ARCH} CACHE STRING "Legion HIP ARCH" FORCE)
+				message(STATUS "Legion_HIP_ARCH: ${Legion_HIP_ARCH}")
 			endif()
 		endif()
 		set(Legion_REDOP_COMPLEX OFF CACHE BOOL "disable complex")
diff --git a/conda/environment.yml b/conda/environment.yml
index 2069acccdf..c1acd7b3da 100644
--- a/conda/environment.yml
+++ b/conda/environment.yml
@@ -7,6 +7,7 @@ dependencies:
   - cffi>=1.11.0
   - Pillow
   - pybind11
+  - rust
   - cmake-build-extension
   - pip
   - pip:
diff --git a/conda/flexflow-cpu.yml b/conda/flexflow-cpu.yml
deleted file mode 100644
index cc6fcf4667..0000000000
--- a/conda/flexflow-cpu.yml
+++ /dev/null
@@ -1,20 +0,0 @@
-name: flexflow
-channels:
-  - defaults
-  - conda-forge
-dependencies:
-  - python>=3.6
-  - cffi>=1.11.0
-  - Pillow
-  - pybind11
-  - cmake-build-extension
-  - pytest
-  - pip
-  - pip:
-    - qualname>=0.1.0
-    - keras_preprocessing>=1.1.2
-    - numpy>=1.16.0
-    - torch --index-url https://download.pytorch.org/whl/cpu
-    - torchaudio --index-url https://download.pytorch.org/whl/cpu
-    - torchvision --index-url https://download.pytorch.org/whl/cpu
-    - requests
diff --git a/conda/flexflow.yml b/conda/flexflow.yml
new file mode 100644
index 0000000000..9ff7f3957a
--- /dev/null
+++ b/conda/flexflow.yml
@@ -0,0 +1,26 @@
+name: flexflow
+channels:
+  - defaults
+  - conda-forge
+dependencies:
+  - python>=3.6
+  - cffi>=1.11.0
+  - Pillow
+  - pybind11
+  - rust
+  - cmake-build-extension
+  - pytest
+  - pip
+  - pip:
+    - qualname>=0.1.0
+    - keras_preprocessing>=1.1.2
+    - numpy>=1.16.0
+    - torch>=1.13.1 --index-url https://download.pytorch.org/whl/cpu
+    - torchaudio>=0.13.1 --index-url https://download.pytorch.org/whl/cpu
+    - torchvision>=0.14.1 --index-url https://download.pytorch.org/whl/cpu
+    - regex
+    - onnx
+    - transformers>=4.31.0
+    - sentencepiece
+    - einops
+    - requests
diff --git a/config/config.inc b/config/config.inc
index b146d228d5..7f1f0ffcf4 100644
--- a/config/config.inc
+++ b/config/config.inc
@@ -27,6 +27,19 @@ if [ -n "$INSTALL_DIR" ]; then
   SET_INSTALL_DIR="-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR}"
 fi
 
+if [ "$INFERENCE_TESTS" = "ON" ]; then
+  SET_INFERENCE_TESTS="-DINFERENCE_TESTS=ON"
+else
+  SET_INFERENCE_TESTS="-DINFERENCE_TESTS=OFF"
+fi
+
+#set cmake prefix path dir
+if [ -n "$LIBTORCH_PATH" ]; then
+  SET_LIBTORCH_PATH="-DLIBTORCH_PATH=${LIBTORCH_PATH}"
+else
+  SET_LIBTORCH_PATH=""
+fi
+
 # set build type
 if [ -n "$BUILD_TYPE" ]; then
   SET_BUILD="-DCMAKE_BUILD_TYPE=${BUILD_TYPE}"
@@ -37,6 +50,11 @@ if [ -n "$FF_CUDA_ARCH" ]; then
   SET_CUDA_ARCH="-DFF_CUDA_ARCH=${FF_CUDA_ARCH}"
 fi
 
+# set HIP Arch
+if [ -n "$FF_HIP_ARCH" ]; then
+  SET_HIP_ARCH="-DFF_HIP_ARCH=${FF_HIP_ARCH}"
+fi
+
 # set CUDA dir
 if [ -n "$CUDA_DIR" ]; then
   SET_CUDA="-DCUDA_PATH=${CUDA_DIR}"
@@ -106,6 +124,13 @@ elif [ "$FF_BUILD_ALL_EXAMPLES" = "OFF" ]; then
 else
   SET_EXAMPLES="-DFF_BUILD_ALL_EXAMPLES=ON"
 fi
+if [ "$FF_BUILD_ALL_INFERENCE_EXAMPLES" = "ON" ]; then
+  SET_INFERENCE_EXAMPLES="-DFF_BUILD_ALL_INFERENCE_EXAMPLES=ON"
+elif [ "$FF_BUILD_ALL_INFERENCE_EXAMPLES" = "OFF" ]; then
+  SET_INFERENCE_EXAMPLES="-DFF_BUILD_ALL_INFERENCE_EXAMPLES=OFF"
+else
+  SET_INFERENCE_EXAMPLES="-DFF_BUILD_ALL_INFERENCE_EXAMPLES=ON"
+fi
 
 # enable C++ unit tests
 if [ "$FF_BUILD_UNIT_TESTS" = "ON" ]; then
@@ -154,6 +179,11 @@ if [ -n "$FF_MAX_DIM" ]; then
   SET_MAX_DIM="-DFF_MAX_DIM=${FF_MAX_DIM}"
 fi
 
+#set LEGION_MAX_RETURN_SIZE
+if [ -n "$LEGION_MAX_RETURN_SIZE" ]; then
+  SET_LEGION_MAX_RETURN_SIZE="-DLEGION_MAX_RETURN_SIZE=${LEGION_MAX_RETURN_SIZE}"
+fi
+
 # set ROCM path
 if [ -n "$ROCM_PATH" ]; then
   SET_ROCM_PATH="-DROCM_PATH=${ROCM_PATH}"
@@ -197,7 +227,7 @@ if [ -n "$FF_GPU_BACKEND" ]; then
   fi
 fi
 
-CMAKE_FLAGS="-DCUDA_USE_STATIC_CUDA_RUNTIME=OFF -DLegion_HIJACK_CUDART=OFF ${SET_CC} ${SET_CXX} ${SET_INSTALL_DIR} ${SET_BUILD} ${SET_CUDA_ARCH} ${SET_CUDA} ${SET_CUDNN} ${SET_PYTHON} ${SET_BUILD_LEGION_ONLY} ${SET_NCCL} ${SET_NCCL_DIR} ${SET_LEGION_NETWORKS} ${SET_EXAMPLES} ${SET_USE_PREBUILT_LEGION} ${SET_USE_PREBUILT_NCCL} ${SET_USE_ALL_PREBUILT_LIBRARIES} ${SET_BUILD_UNIT_TESTS} ${SET_AVX2} ${SET_MAX_DIM} ${SET_ROCM_PATH} ${SET_FF_GPU_BACKEND}"
+CMAKE_FLAGS="-DCUDA_USE_STATIC_CUDA_RUNTIME=OFF -DLegion_HIJACK_CUDART=OFF ${SET_CC} ${SET_CXX} ${SET_INSTALL_DIR} ${SET_INFERENCE_TESTS} ${SET_LIBTORCH_PATH} ${SET_BUILD} ${SET_CUDA_ARCH} ${SET_CUDA} ${SET_CUDNN} ${SET_HIP_ARCH} ${SET_PYTHON} ${SET_BUILD_LEGION_ONLY} ${SET_NCCL} ${SET_NCCL_DIR} ${SET_LEGION_NETWORKS} ${SET_EXAMPLES} ${SET_INFERENCE_EXAMPLES} ${SET_USE_PREBUILT_LEGION} ${SET_USE_PREBUILT_NCCL} ${SET_USE_ALL_PREBUILT_LIBRARIES} ${SET_BUILD_UNIT_TESTS} ${SET_AVX2} ${SET_MAX_DIM} ${SET_LEGION_MAX_RETURN_SIZE} ${SET_ROCM_PATH} ${SET_FF_GPU_BACKEND}"
 
 function run_cmake() {
 SRC_LOCATION=${SRC_LOCATION:=`dirname $0`/../}
diff --git a/config/config.linux b/config/config.linux
index 509a713e66..056ebe0fed 100755
--- a/config/config.linux
+++ b/config/config.linux
@@ -1,5 +1,4 @@
 #!/bin/bash
-
 # set the CC and CXX, usually it is not needed as cmake can detect it
 # set CC and CXX to mpicc and mpic++ when enable gasnet
 # CC=mpicc
@@ -16,11 +15,26 @@
 # set build type
 BUILD_TYPE=${BUILD_TYPE:-Release}
 
+INFERENCE_TESTS=${INFERENCE_TESTS:-OFF}
+LIBTORCH_PATH=${LIBTORCH_PATH:-"$(realpath ../..)/libtorch"}
+if [[ "$INFERENCE_TESTS" == "ON" && ! -d "$LIBTORCH_PATH" ]]; then
+    cwd="$(pwd)"
+    cd ../..
+    wget https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
+    unzip libtorch-shared-with-deps-latest.zip
+    rm libtorch-shared-with-deps-latest.zip
+    LIBTORCH_PATH="$(pwd)/libtorch"
+    cd "$cwd"
+fi
+
 # set CUDA Arch to the desired GPU architecture(s) to target (e.g. pass "FF_CUDA_ARCH=60" for Pascal). 
 # To pass more than one value, separate architecture numbers with a comma (e.g. FF_CUDA_ARCH=70,75).
 # Alternatively, set "FF_CUDA_ARCH=autodetect" to build FlexFlow for all architectures detected on the machine,
 # or set "FF_CUDA_ARCH=all" to build FlexFlow for all supported GPU architectures
 FF_CUDA_ARCH=${FF_CUDA_ARCH:-"autodetect"}
+# FF_HIP_ARCH only supports building for a specific AMD architecture, a list of architectures separated by a comma
+# or all available architectures. TODO: support autodetect
+FF_HIP_ARCH=${FF_HIP_ARCH:-"all"}
 
 # set CUDNN dir in case cmake cannot autodetect a path
 CUDNN_DIR=${CUDNN_DIR:-"/usr/local/cuda"}
@@ -45,6 +59,7 @@ FF_UCX_URL=${FF_UCX_URL:-""}
 
 # build C++ examples
 FF_BUILD_ALL_EXAMPLES=${FF_BUILD_ALL_EXAMPLES:-OFF}
+FF_BUILD_ALL_INFERENCE_EXAMPLES=${FF_BUILD_ALL_INFERENCE_EXAMPLES:-ON}
 
 # build C++ unit tests
 FF_BUILD_UNIT_TESTS=${FF_BUILD_UNIT_TESTS:-OFF}
@@ -65,6 +80,9 @@ FF_MAX_DIM=${FF_MAX_DIM:-5}
 # set BUILD_LEGION_ONLY
 BUILD_LEGION_ONLY=${BUILD_LEGION_ONLY:-OFF}
 
+# set LEGION_MAX_RETURN_SIZE
+LEGION_MAX_RETURN_SIZE=${LEGION_MAX_RETURN_SIZE:-262144}
+
 # set ROCM path
 ROCM_PATH=${ROCM_PATH:-"/opt/rocm"}
 
@@ -82,7 +100,7 @@ fi
 
 function get_build_configs() {
     # Create a string with the values of the variables set in this script
-    BUILD_CONFIGS="FF_CUDA_ARCH=${FF_CUDA_ARCH} CUDNN_DIR=${CUDNN_DIR} CUDA_DIR=${CUDA_DIR} NCCL_DIR=${NCCL_DIR} FF_USE_PYTHON=${FF_USE_PYTHON} BUILD_LEGION_ONLY=${BUILD_LEGION_ONLY} FF_GASNET_CONDUIT=${FF_GASNET_CONDUIT} FF_UCX_URL=${FF_UCX_URL} FF_LEGION_NETWORKS=${FF_LEGION_NETWORKS} FF_BUILD_ALL_EXAMPLES=${FF_BUILD_ALL_EXAMPLES} FF_BUILD_UNIT_TESTS=${FF_BUILD_UNIT_TESTS} FF_USE_PREBUILT_NCCL=${FF_USE_PREBUILT_NCCL} FF_USE_PREBUILT_LEGION=${FF_USE_PREBUILT_LEGION} FF_USE_ALL_PREBUILT_LIBRARIES=${FF_USE_ALL_PREBUILT_LIBRARIES} FF_USE_AVX2=${FF_USE_AVX2} FF_MAX_DIM=${FF_MAX_DIM} ROCM_PATH=${ROCM_PATH} FF_GPU_BACKEND=${FF_GPU_BACKEND} INSTALL_DIR=${INSTALL_DIR}"
+    BUILD_CONFIGS="FF_CUDA_ARCH=${FF_CUDA_ARCH} FF_HIP_ARCH=${FF_HIP_ARCH} CUDNN_DIR=${CUDNN_DIR} CUDA_DIR=${CUDA_DIR} NCCL_DIR=${NCCL_DIR} FF_USE_PYTHON=${FF_USE_PYTHON} BUILD_LEGION_ONLY=${BUILD_LEGION_ONLY} FF_GASNET_CONDUIT=${FF_GASNET_CONDUIT} FF_UCX_URL=${FF_UCX_URL} FF_LEGION_NETWORKS=${FF_LEGION_NETWORKS} FF_BUILD_ALL_EXAMPLES=${FF_BUILD_ALL_EXAMPLES} FF_BUILD_ALL_INFERENCE_EXAMPLES=${FF_BUILD_ALL_INFERENCE_EXAMPLES} FF_BUILD_UNIT_TESTS=${FF_BUILD_UNIT_TESTS} FF_USE_PREBUILT_NCCL=${FF_USE_PREBUILT_NCCL} FF_USE_PREBUILT_LEGION=${FF_USE_PREBUILT_LEGION} FF_USE_ALL_PREBUILT_LIBRARIES=${FF_USE_ALL_PREBUILT_LIBRARIES} FF_USE_AVX2=${FF_USE_AVX2} FF_MAX_DIM=${FF_MAX_DIM} ROCM_PATH=${ROCM_PATH} FF_GPU_BACKEND=${FF_GPU_BACKEND}"
 }
 
 if [[ -n "$1" && ( "$1" == "CMAKE_FLAGS" || "$1" == "CUDA_PATH" ) ]]; then
diff --git a/deps/tokenizers-cpp b/deps/tokenizers-cpp
new file mode 160000
index 0000000000..4f42c9fa74
--- /dev/null
+++ b/deps/tokenizers-cpp
@@ -0,0 +1 @@
+Subproject commit 4f42c9fa74946d70af86671a3804b6f2433e5dac
diff --git a/docker/build.sh b/docker/build.sh
index a254fb3116..6603d919f5 100755
--- a/docker/build.sh
+++ b/docker/build.sh
@@ -2,7 +2,7 @@
 set -euo pipefail
 
 # Usage: ./build.sh <docker_image_name>
-# Optional environment variables: FF_GPU_BACKEND, cuda_version
+# Optional environment variables: FF_GPU_BACKEND, cuda_version, hip_version
 
 # Cd into $FF_HOME. Assumes this script is in $FF_HOME/docker
 cd "${BASH_SOURCE[0]%/*}/.."
@@ -11,6 +11,7 @@ cd "${BASH_SOURCE[0]%/*}/.."
 image=${1:-flexflow}
 FF_GPU_BACKEND=${FF_GPU_BACKEND:-cuda}
 cuda_version=${cuda_version:-"empty"}
+hip_version=${hip_version:-"empty"}
 python_version=${python_version:-latest}
 
 # Check docker image name
@@ -29,58 +30,98 @@ else
   echo "Building $image docker image with default GPU backend: cuda"
 fi
 
+# base image to use when building the flexflow environment docker image.
+ff_environment_base_image="ubuntu:20.04"
+# gpu backend version suffix for the docker image.
+gpu_backend_version=""
+
 if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
   # Autodetect cuda version if not specified
   if [[ $cuda_version == "empty" ]]; then
-    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}')
+    # shellcheck disable=SC2015
+    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}' || true)
     # Change cuda_version eg. V11.7.99 to 11.7
     cuda_version=${cuda_version:1:4}
+    if [[ -z "$cuda_version" ]]; then
+      echo "Could not detect CUDA version. Please specify one manually by setting the 'cuda_version' env."
+      exit 1
+    fi
   fi
   # Check that CUDA version is supported, and modify cuda version to include default subsubversion
-  if [[ "$cuda_version" == @(11.1|11.3|11.7) ]]; then
+  if [[ "$cuda_version" == @(11.1|11.3|11.7|12.0|12.1) ]]; then
     cuda_version_input=${cuda_version}.1
   elif [[ "$cuda_version" == @(11.2|11.5|11.6) ]]; then 
     cuda_version_input=${cuda_version}.2
-  elif [[ "$cuda_version" == @(11.8) ]]; then 
+  elif [[ "$cuda_version" == @(11.4) ]]; then 
+    cuda_version_input=${cuda_version}.3
+  elif [[ "$cuda_version" == @(11.8|12.2) ]]; then 
     cuda_version_input=${cuda_version}.0
   else
-    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.5|11.6|11.7|11.8}"
+    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
     exit 1
   fi
-  # Set cuda version suffix to docker image name
+  # Use CUDA 12.0 for all versions greater or equal to 12.0 for now
+  if [[ "$cuda_version" == @(12.1|12.2|12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
+    cuda_version=12.0
+    cuda_version_input=${cuda_version}.1
+  fi
   echo "Building $image docker image with CUDA $cuda_version"
-  cuda_version="-${cuda_version}"
-else
-  # Empty cuda version suffix for non-CUDA images
-  cuda_version=""
-  # Pick a default CUDA version for the base docker image from NVIDIA
-  cuda_version_input="11.8.0"
+  ff_environment_base_image="nvidia/cuda:${cuda_version_input}-cudnn8-devel-ubuntu20.04"
+  gpu_backend_version="-${cuda_version}"
+fi
+
+if [[ "${FF_GPU_BACKEND}" == "hip_rocm" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
+  # Autodetect HIP version if not specified
+  if [[ $hip_version == "empty" ]]; then
+    # shellcheck disable=SC2015
+    hip_version=$(command -v hipcc >/dev/null 2>&1 && hipcc --version | grep "HIP version:" | awk '{print $NF}' || true)
+    # Change hip_version eg. 5.6.31061-8c743ae5d to 5.6
+    hip_version=${hip_version:0:3}
+    if [[ -z "$hip_version" ]]; then
+      echo "Could not detect HIP version. Please specify one manually by setting the 'hip_version' env."
+      exit 1
+    fi
+  fi
+  # Check that HIP version is supported
+  if [[ "$hip_version" != @(5.3|5.4|5.5|5.6) ]]; then
+    echo "hip_version is not supported, please choose among {5.3, 5.4, 5.5, 5.6}"
+    exit 1
+  fi
+  echo "Building $image docker image with HIP $hip_version"
+  if [[ "${FF_GPU_BACKEND}" == "hip_rocm" ]]; then
+    gpu_backend_version="-${hip_version}"
+  fi
 fi
 
+# Get number of cores available on the machine. Build with all cores but one, to prevent RAM choking
+cores_available=$(nproc --all)
+n_build_cores=$(( cores_available -1 ))
+
 # check python_version
 if [[ "$python_version" != @(3.8|3.9|3.10|3.11|latest) ]]; then
   echo "python_version not supported!"
   exit 0
 fi
 
-docker build --build-arg "FF_GPU_BACKEND=${FF_GPU_BACKEND}" --build-arg "cuda_version=${cuda_version_input}" --build-arg "python_version=${python_version}" -t "flexflow-environment-${FF_GPU_BACKEND}${cuda_version}" -f docker/flexflow-environment/Dockerfile .
+docker build --build-arg "ff_environment_base_image=${ff_environment_base_image}" --build-arg "N_BUILD_CORES=${n_build_cores}" --build-arg "FF_GPU_BACKEND=${FF_GPU_BACKEND}" --build-arg "hip_version=${hip_version}" --build-arg "python_version=${python_version}" -t "flexflow-environment-${FF_GPU_BACKEND}${gpu_backend_version}" -f docker/flexflow-environment/Dockerfile .
 
 # If the user only wants to build the environment image, we are done
 if [[ "$image" == "flexflow-environment" ]]; then
   exit 0
 fi
 
-# Gather arguments needed to build the FlexFlow image
-# Get number of cores available on the machine. Build with all cores but one, to prevent RAM choking
-cores_available=$(nproc --all)
-n_build_cores=$(( cores_available -1 ))
+# Done with flexflow-environment image
+
+###########################################################################################
 
-# If FF_CUDA_ARCH is set to autodetect, we need to perform the autodetection here because the Docker
-# image will not have access to GPUs during the build phase (due to a Docker restriction). In all other
-# cases, we pass the value of FF_CUDA_ARCH directly to Cmake.
-if [[ "${FF_CUDA_ARCH:-autodetect}" == "autodetect" ]]; then
-  # Get CUDA architecture(s), if GPUs are available
-  cat << EOF > ./get_gpu_arch.cu
+# Build flexflow image if requested 
+if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
+  # If FF_CUDA_ARCH is set to autodetect, we need to perform the autodetection here because the Docker
+  # image will not have access to GPUs during the build phase (due to a Docker restriction). In all other
+  # cases, we pass the value of FF_CUDA_ARCH directly to Cmake.
+  if [[ "${FF_CUDA_ARCH:-autodetect}" == "autodetect" ]]; then
+    # Get CUDA architecture(s), if GPUs are available
+    cat << EOF > ./get_gpu_arch.cu
 #include <stdio.h>
 int main() {
   int count = 0;
@@ -94,24 +135,25 @@ int main() {
   return 0;
 }
 EOF
-  gpu_arch_codes=""
-  if command -v nvcc &> /dev/null
-  then
-    nvcc ./get_gpu_arch.cu -o ./get_gpu_arch
-    gpu_arch_codes="$(./get_gpu_arch)"
-  fi
-  gpu_arch_codes="$(echo "$gpu_arch_codes" | xargs -n1 | sort -u | xargs)"
-  gpu_arch_codes="${gpu_arch_codes// /,}"
-  rm -f ./get_gpu_arch.cu ./get_gpu_arch
-
-  if [[ -n "$gpu_arch_codes" ]]; then
-  echo "Host machine has GPUs with architecture codes: $gpu_arch_codes"
-  echo "Configuring FlexFlow to build for the $gpu_arch_codes code(s)."
-  FF_CUDA_ARCH="${gpu_arch_codes}"
-  export FF_CUDA_ARCH
-  else
-    echo "FF_CUDA_ARCH is set to 'autodetect', but the host machine does not have any compatible GPUs."
-    exit 1
+    gpu_arch_codes=""
+    if command -v nvcc &> /dev/null
+    then
+      nvcc ./get_gpu_arch.cu -o ./get_gpu_arch
+      gpu_arch_codes="$(./get_gpu_arch)"
+    fi
+    gpu_arch_codes="$(echo "$gpu_arch_codes" | xargs -n1 | sort -u | xargs)"
+    gpu_arch_codes="${gpu_arch_codes// /,}"
+    rm -f ./get_gpu_arch.cu ./get_gpu_arch
+
+    if [[ -n "$gpu_arch_codes" ]]; then
+    echo "Host machine has GPUs with architecture codes: $gpu_arch_codes"
+    echo "Configuring FlexFlow to build for the $gpu_arch_codes code(s)."
+    FF_CUDA_ARCH="${gpu_arch_codes}"
+    export FF_CUDA_ARCH
+    else
+      echo "FF_CUDA_ARCH is set to 'autodetect', but the host machine does not have any compatible GPUs."
+      exit 1
+    fi
   fi
 fi
 
@@ -121,4 +163,4 @@ fi
 # Set value of BUILD_CONFIGS
 get_build_configs
 
-docker build --build-arg "N_BUILD_CORES=${n_build_cores}" --build-arg "FF_GPU_BACKEND=${FF_GPU_BACKEND}" --build-arg "BUILD_CONFIGS=${BUILD_CONFIGS}" --build-arg "cuda_version=${cuda_version}" -t "flexflow-${FF_GPU_BACKEND}${cuda_version}" -f docker/flexflow/Dockerfile .
+docker build --build-arg "N_BUILD_CORES=${n_build_cores}" --build-arg "FF_GPU_BACKEND=${FF_GPU_BACKEND}" --build-arg "BUILD_CONFIGS=${BUILD_CONFIGS}" --build-arg "gpu_backend_version=${gpu_backend_version}" -t "flexflow-${FF_GPU_BACKEND}${gpu_backend_version}" -f docker/flexflow/Dockerfile .
diff --git a/docker/flexflow-environment/Dockerfile b/docker/flexflow-environment/Dockerfile
index 7132276afe..524f179e7a 100644
--- a/docker/flexflow-environment/Dockerfile
+++ b/docker/flexflow-environment/Dockerfile
@@ -1,12 +1,11 @@
-ARG cuda_version
-FROM nvidia/cuda:${cuda_version}-cudnn8-devel-ubuntu20.04
-ARG python_version
+ARG ff_environment_base_image
+FROM ${ff_environment_base_image}
 
 LABEL org.opencontainers.image.source=https://github.com/flexflow/FlexFlow
 LABEL org.opencontainers.image.description="FlexFlow environment container"
 
 # Install basic dependencies
-RUN apt-get update && apt-get install -y --no-install-recommends wget sudo binutils git zlib1g-dev lsb-release nano libhdf5-dev && \
+RUN apt-get update && apt-get install -y --no-install-recommends wget sudo binutils git zlib1g-dev lsb-release nano gdb libhdf5-dev && \
     rm -rf /var/lib/apt/lists/* /etc/apt/sources.list.d/cuda.list /etc/apt/sources.list.d/nvidia-ml.list && \
 	apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends software-properties-common && \
     apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential apt-utils \
@@ -17,6 +16,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends wget sudo binut
         apt-get upgrade -y libstdc++6
 
 # Install Python3 with Miniconda
+ARG python_version
 RUN echo "current python version is ${python_version}"
 RUN echo "downloading python from miniconda"
 RUN if [ "${python_version}" = "3.8" ]; then \
@@ -81,13 +81,39 @@ RUN if [ "${python_version}" = "3.8" ]; then \
 # in the container. It also attempts to install packages for a graphical install.
 # For our container, we don't need `hip-runtime-nvidia`
 ARG FF_GPU_BACKEND "cuda"
+ARG hip_version "5.6"
+ARG N_BUILD_CORES
+# set MAKEFLAGS to speedup any dependency that uses make
+ENV MAKEFLAGS "${MAKEFLAGS} -j${N_BUILD_CORES}"
+
 RUN  if [ "$FF_GPU_BACKEND" = "hip_cuda" ] || [ "$FF_GPU_BACKEND" = "hip_rocm" ]; then \
         echo "FF_GPU_BACKEND: ${FF_GPU_BACKEND}. Installing HIP dependencies"; \
-        wget https://repo.radeon.com/amdgpu-install/22.20.5/ubuntu/bionic/amdgpu-install_22.20.50205-1_all.deb; \
-        apt-get install -y ./amdgpu-install_22.20.50205-1_all.deb; \
-        rm ./amdgpu-install_22.20.50205-1_all.deb; \
+        # Check that hip_version is one of 5.3,5.4,5.5,5.6
+        if [ "$hip_version" != "5.3" ] && [ "$hip_version" != "5.4" ] && [ "$hip_version" != "5.5" ] && [ "$hip_version" != "5.6" ]; then \
+            echo "hip_version '${hip_version}' is not supported, please choose among {5.3, 5.4, 5.5, 5.6}"; \
+            exit 1; \
+        fi; \
+        # Compute script name and url given the version
+        AMD_GPU_SCRIPT_NAME=amdgpu-install_5.6.50600-1_all.deb; \
+        if [ "$hip_version" = "5.3" ]; then \
+            AMD_GPU_SCRIPT_NAME=amdgpu-install_5.3.50300-1_all.deb; \
+        elif [ "$hip_version" = "5.4" ]; then \
+            AMD_GPU_SCRIPT_NAME=amdgpu-install_5.4.50400-1_all.deb; \
+        elif [ "$hip_version" = "5.5" ]; then \
+            AMD_GPU_SCRIPT_NAME=amdgpu-install_5.5.50500-1_all.deb; \
+        fi; \
+        AMD_GPU_SCRIPT_URL="https://repo.radeon.com/amdgpu-install/${hip_version}/ubuntu/focal/${AMD_GPU_SCRIPT_NAME}"; \
+        # Download and install AMD GPU software with ROCM and HIP support
+        wget $AMD_GPU_SCRIPT_URL; \
+        apt-get install -y ./${AMD_GPU_SCRIPT_NAME}; \
+        rm ./${AMD_GPU_SCRIPT_NAME}; \
         amdgpu-install -y --usecase=hip,rocm --no-dkms; \
-        apt-get install -y hip-dev hipblas miopen-hip rocm-hip-sdk; \
+        apt-get install -y hip-dev hipblas miopen-hip rocm-hip-sdk rocm-device-libs; \
+        # Install protobuf v3.20.x manually
+        apt-get update -y && sudo apt-get install -y pkg-config zip g++ zlib1g-dev autoconf automake libtool make; \
+        git clone -b 3.20.x https://github.com/protocolbuffers/protobuf.git; cd protobuf/ ; git submodule update --init --recursive; \
+        ./autogen.sh; ./configure; cores_available=$(nproc --all); n_build_cores=$(( cores_available -1 )); \
+        if (( n_build_cores < 1 )) ; then n_build_cores=1 ; fi; make -j $n_build_cores; make install; ldconfig; cd .. ; \
     else \
         echo "FF_GPU_BACKEND: ${FF_GPU_BACKEND}. Skipping installing HIP dependencies"; \
     fi
@@ -102,7 +128,11 @@ ENV CUDA_DIR /usr/local/cuda
 RUN conda install -c conda-forge cmake make pillow cmake-build-extension pybind11 numpy pandas keras-preprocessing
 # Install CPU-only Pytorch and related dependencies
 RUN conda install pytorch torchvision torchaudio cpuonly -c pytorch
-RUN conda install -c conda-forge onnx transformers sentencepiece
+RUN conda install -c conda-forge onnx transformers>=4.31.0 sentencepiece einops
 RUN pip3 install tensorflow
 
-ENTRYPOINT ["/bin/bash"]
\ No newline at end of file
+# Install Rust
+RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
+ENV PATH /root/.cargo/bin:$PATH
+
+ENTRYPOINT ["/bin/bash"]
diff --git a/docker/flexflow/Dockerfile b/docker/flexflow/Dockerfile
index d25ede4b3b..ba592e2626 100644
--- a/docker/flexflow/Dockerfile
+++ b/docker/flexflow/Dockerfile
@@ -1,6 +1,6 @@
 ARG FF_GPU_BACKEND "cuda"
-ARG cuda_version ""
-FROM flexflow-environment-$FF_GPU_BACKEND$cuda_version:latest
+ARG gpu_backend_version ""
+FROM flexflow-environment-$FF_GPU_BACKEND$gpu_backend_version:latest
 
 LABEL org.opencontainers.image.source=https://github.com/flexflow/FlexFlow
 LABEL org.opencontainers.image.description="FlexFlow container"
diff --git a/docker/publish.sh b/docker/publish.sh
index b8668d3c0e..c70419a9cc 100755
--- a/docker/publish.sh
+++ b/docker/publish.sh
@@ -2,7 +2,7 @@
 set -euo pipefail
 
 # Usage: ./publish.sh <docker_image_name>
-# Optional environment variables: FF_GPU_BACKEND, cuda_version
+# Optional environment variables: FF_GPU_BACKEND, cuda_version, hip_version
 
 # Cd into directory holding this script
 cd "${BASH_SOURCE[0]%/*}"
@@ -11,6 +11,7 @@ cd "${BASH_SOURCE[0]%/*}"
 image=${1:-flexflow}
 FF_GPU_BACKEND=${FF_GPU_BACKEND:-cuda}
 cuda_version=${cuda_version:-"empty"}
+hip_version=${hip_version:-"empty"}
 
 # Check docker image name
 if [[ "${image}" != @(flexflow-environment|flexflow) ]]; then
@@ -18,6 +19,9 @@ if [[ "${image}" != @(flexflow-environment|flexflow) ]]; then
   exit 1
 fi
 
+# gpu backend version suffix for the docker image.
+gpu_backend_version=""
+
 # Check GPU backend
 if [[ "${FF_GPU_BACKEND}" != @(cuda|hip_cuda|hip_rocm|intel) ]]; then
   echo "Error, value of FF_GPU_BACKEND (${FF_GPU_BACKEND}) is invalid. Pick between 'cuda', 'hip_cuda', 'hip_rocm' or 'intel'."
@@ -31,25 +35,50 @@ fi
 if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
   # Autodetect cuda version if not specified
   if [[ $cuda_version == "empty" ]]; then
-    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}')
+    # shellcheck disable=SC2015
+    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}' || true)
     # Change cuda_version eg. V11.7.99 to 11.7
     cuda_version=${cuda_version:1:4}
+    if [[ -z "$cuda_version" ]]; then
+      echo "Could not detect CUDA version. Please specify one manually by setting the 'cuda_version' env."
+      exit 1
+    fi
   fi
   # Check that CUDA version is supported
-  if [[ "$cuda_version" != @(11.1|11.3|11.7|11.2|11.5|11.6|11.8) ]]; then
-    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.5|11.6|11.7|11.8}"
+  if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2) ]]; then
+    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
     exit 1
   fi
   # Set cuda version suffix to docker image name
   echo "Publishing $image docker image with CUDA $cuda_version"
-  cuda_version="-${cuda_version}"
-else
-  # Empty cuda version suffix for non-CUDA images
-  cuda_version=""
+  gpu_backend_version="-${cuda_version}"
+fi
+
+if [[ "${FF_GPU_BACKEND}" == "hip_rocm" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
+  # Autodetect HIP version if not specified
+  if [[ $hip_version == "empty" ]]; then
+    # shellcheck disable=SC2015
+    hip_version=$(command -v hipcc >/dev/null 2>&1 && hipcc --version | grep "HIP version:" | awk '{print $NF}' || true)
+    # Change hip_version eg. 5.6.31061-8c743ae5d to 5.6
+    hip_version=${hip_version:0:3}
+    if [[ -z "$hip_version" ]]; then
+      echo "Could not detect HIP version. Please specify one manually by setting the 'hip_version' env."
+      exit 1
+    fi
+  fi
+  # Check that HIP version is supported
+  if [[ "$hip_version" != @(5.3|5.4|5.5|5.6) ]]; then
+    echo "hip_version is not supported, please choose among {5.3, 5.4, 5.5, 5.6}"
+    exit 1
+  fi
+  echo "Pubilishing $image docker image with HIP $hip_version"
+  if [[ "${FF_GPU_BACKEND}" == "hip_rocm" ]]; then
+    gpu_backend_version="-${hip_version}"
+  fi
 fi
 
 # Check that image exists
-docker image inspect "${image}-${FF_GPU_BACKEND}${cuda_version}":latest > /dev/null
+docker image inspect "${image}-${FF_GPU_BACKEND}${gpu_backend_version}":latest > /dev/null
 
 # Log into container registry
 FLEXFLOW_CONTAINER_TOKEN=${FLEXFLOW_CONTAINER_TOKEN:-}
@@ -59,8 +88,8 @@ echo "$FLEXFLOW_CONTAINER_TOKEN" | docker login ghcr.io -u flexflow --password-s
 # Tag image to be uploaded
 git_sha=${GITHUB_SHA:-$(git rev-parse HEAD)}
 if [ -z "$git_sha" ]; then echo "Commit hash cannot be detected, cannot publish the docker image to ghrc.io"; exit; fi
-docker tag "${image}-${FF_GPU_BACKEND}${cuda_version}":latest ghcr.io/flexflow/"${image}-${FF_GPU_BACKEND}${cuda_version}":latest
+docker tag "${image}-${FF_GPU_BACKEND}${gpu_backend_version}":latest ghcr.io/flexflow/"${image}-${FF_GPU_BACKEND}${gpu_backend_version}":latest
 
 # Upload image
-docker push ghcr.io/flexflow/"${image}-${FF_GPU_BACKEND}${cuda_version}":latest
+docker push ghcr.io/flexflow/"${image}-${FF_GPU_BACKEND}${gpu_backend_version}":latest
 
diff --git a/docker/pull.sh b/docker/pull.sh
index f8624a1072..e5b6f26f3c 100755
--- a/docker/pull.sh
+++ b/docker/pull.sh
@@ -2,7 +2,7 @@
 set -euo pipefail
 
 # Usage: ./pull.sh <docker_image_name>
-# Optional environment variables: FF_GPU_BACKEND, cuda_version
+# Optional environment variables: FF_GPU_BACKEND, cuda_version, hip_version
 
 # Cd into directory holding this script
 cd "${BASH_SOURCE[0]%/*}"
@@ -11,6 +11,7 @@ cd "${BASH_SOURCE[0]%/*}"
 image=${1:-flexflow}
 FF_GPU_BACKEND=${FF_GPU_BACKEND:-cuda}
 cuda_version=${cuda_version:-"empty"}
+hip_version=${hip_version:-"empty"}
 
 # Check docker image name
 if [[ "${image}" != @(flexflow-environment|flexflow) ]]; then
@@ -28,31 +29,63 @@ else
   echo "Downloading $image docker image with default GPU backend: cuda"
 fi
 
+# gpu backend version suffix for the docker image.
+gpu_backend_version=""
+
 if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
   # Autodetect cuda version if not specified
   if [[ $cuda_version == "empty" ]]; then
-    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}')
+    # shellcheck disable=SC2015
+    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}' || true)
     # Change cuda_version eg. V11.7.99 to 11.7
     cuda_version=${cuda_version:1:4}
+    if [[ -z "$cuda_version" ]]; then
+      echo "Could not detect CUDA version. Please specify one manually by setting the 'cuda_version' env."
+      exit 1
+    fi
   fi
   # Check that CUDA version is supported
-  if [[ "$cuda_version" != @(11.1|11.3|11.7|11.2|11.5|11.6|11.8) ]]; then
-    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.5|11.6|11.7|11.8}"
+  if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2) ]]; then
+    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
     exit 1
   fi
+  # Use CUDA 12.0 for all versions greater or equal to 12.0 for now
+  if [[ "$cuda_version" == @(12.1|12.2|12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
+    cuda_version=12.0
+  fi
   # Set cuda version suffix to docker image name
   echo "Downloading $image docker image with CUDA $cuda_version"
-  cuda_version="-${cuda_version}"
-else
-  # Empty cuda version suffix for non-CUDA images
-  cuda_version=""
+  gpu_backend_version="-${cuda_version}"
+fi
+
+if [[ "${FF_GPU_BACKEND}" == "hip_rocm" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
+  # Autodetect HIP version if not specified
+  if [[ $hip_version == "empty" ]]; then
+    # shellcheck disable=SC2015
+    hip_version=$(command -v hipcc >/dev/null 2>&1 && hipcc --version | grep "HIP version:" | awk '{print $NF}' || true)
+    # Change hip_version eg. 5.6.31061-8c743ae5d to 5.6
+    hip_version=${hip_version:0:3}
+    if [[ -z "$hip_version" ]]; then
+      echo "Could not detect HIP version. Please specify one manually by setting the 'hip_version' env."
+      exit 1
+    fi
+  fi
+  # Check that HIP version is supported
+  if [[ "$hip_version" != @(5.3|5.4|5.5|5.6) ]]; then
+    echo "hip_version is not supported, please choose among {5.3, 5.4, 5.5, 5.6}"
+    exit 1
+  fi
+  echo "Downloading $image docker image with HIP $hip_version"
+  if [[ "${FF_GPU_BACKEND}" == "hip_rocm" ]]; then
+    gpu_backend_version="-${hip_version}"
+  fi
 fi
 
 # Download image
-docker pull ghcr.io/flexflow/"$image-${FF_GPU_BACKEND}${cuda_version}"
+docker pull ghcr.io/flexflow/"$image-${FF_GPU_BACKEND}${gpu_backend_version}"
 
 # Tag downloaded image
-docker tag ghcr.io/flexflow/"$image-${FF_GPU_BACKEND}${cuda_version}":latest "$image-${FF_GPU_BACKEND}${cuda_version}":latest 
+docker tag ghcr.io/flexflow/"$image-${FF_GPU_BACKEND}${gpu_backend_version}":latest "$image-${FF_GPU_BACKEND}${gpu_backend_version}":latest 
 
 # Check that image exists
-docker image inspect "${image}-${FF_GPU_BACKEND}${cuda_version}":latest > /dev/null
+docker image inspect "${image}-${FF_GPU_BACKEND}${gpu_backend_version}":latest > /dev/null
diff --git a/docker/run.sh b/docker/run.sh
index 43571a252b..76ec1e1ceb 100755
--- a/docker/run.sh
+++ b/docker/run.sh
@@ -2,7 +2,7 @@
 set -euo pipefail
 
 # Usage: ./run.sh <docker_image_name>
-# Optional environment variables: FF_GPU_BACKEND, cuda_version, ATTACH_GPUS, SHM_SIZE
+# Optional environment variables: FF_GPU_BACKEND, cuda_version, hip_version, ATTACH_GPUS, SHM_SIZE
 
 # Cd into directory holding this script
 cd "${BASH_SOURCE[0]%/*}"
@@ -11,13 +11,16 @@ cd "${BASH_SOURCE[0]%/*}"
 image=${1:-flexflow}
 FF_GPU_BACKEND=${FF_GPU_BACKEND:-cuda}
 cuda_version=${cuda_version:-"empty"}
-detached=${detached:-"OFF"}
+hip_version=${hip_version:-"empty"}
 
 # Parameter controlling whether to attach GPUs to the Docker container
 ATTACH_GPUS=${ATTACH_GPUS:-true}
 gpu_arg=""
 if $ATTACH_GPUS ; then gpu_arg="--gpus all" ; fi
 
+# Whether to attach inference weights / files (make sure to download the weights first)
+ATTACH_INFERENCE_FILES=${ATTACH_INFERENCE_FILES:-false}
+
 # Amount of shared memory to give the Docker container access to
 # If you get a Bus Error, increase this value. If you don't have enough memory
 # on your machine, decrease this value.
@@ -39,36 +42,82 @@ else
   echo "Running $image docker image with default GPU backend: cuda"
 fi
 
+# gpu backend version suffix for the docker image.
+gpu_backend_version=""
+
 if [[ "${FF_GPU_BACKEND}" == "cuda" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
   # Autodetect cuda version if not specified
   if [[ $cuda_version == "empty" ]]; then
-    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}')
+    # shellcheck disable=SC2015
+    cuda_version=$(command -v nvcc >/dev/null 2>&1 && nvcc --version | grep "release" | awk '{print $NF}' || true)
     # Change cuda_version eg. V11.7.99 to 11.7
     cuda_version=${cuda_version:1:4}
+    if [[ -z "$cuda_version" ]]; then
+      echo "Could not detect CUDA version. Please specify one manually by setting the 'cuda_version' env."
+      exit 1
+    fi
   fi
   # Check that CUDA version is supported
-  if [[ "$cuda_version" != @(11.1|11.3|11.7|11.2|11.5|11.6|11.8) ]]; then
-    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.5|11.6|11.7|11.8}"
+  if [[ "$cuda_version" != @(11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2) ]]; then
+    echo "cuda_version is not supported, please choose among {11.1|11.2|11.3|11.4|11.5|11.6|11.7|11.8|12.0|12.1|12.2}"
     exit 1
   fi
+  # Use CUDA 12.0 for all versions greater or equal to 12.0 for now
+  if [[ "$cuda_version" == @(12.1|12.2|12.3|12.4|12.5|12.6|12.7|12.8|12.9) ]]; then
+    cuda_version=12.0
+  fi
   # Set cuda version suffix to docker image name
   echo "Running $image docker image with CUDA $cuda_version"
-  cuda_version_hyphen="-${cuda_version}"
-else
-  # Empty cuda version suffix for non-CUDA images
-  cuda_version_hyphen=""
+  gpu_backend_version="-${cuda_version}"
+fi
+
+if [[ "${FF_GPU_BACKEND}" == "hip_rocm" || "${FF_GPU_BACKEND}" == "hip_cuda" ]]; then
+  # Autodetect HIP version if not specified
+  if [[ $hip_version == "empty" ]]; then
+    # shellcheck disable=SC2015
+    hip_version=$(command -v hipcc >/dev/null 2>&1 && hipcc --version | grep "HIP version:" | awk '{print $NF}' || true)
+    # Change hip_version eg. 5.6.31061-8c743ae5d to 5.6
+    hip_version=${hip_version:0:3}
+    if [[ -z "$hip_version" ]]; then
+      echo "Could not detect HIP version. Please specify one manually by setting the 'hip_version' env."
+      exit 1
+    fi
+  fi
+  # Check that HIP version is supported
+  if [[ "$hip_version" != @(5.3|5.4|5.5|5.6) ]]; then
+    echo "hip_version is not supported, please choose among {5.3, 5.4, 5.5, 5.6}"
+    exit 1
+  fi
+  echo "Running $image docker image with HIP $hip_version"
+  if [[ "${FF_GPU_BACKEND}" == "hip_rocm" ]]; then
+    gpu_backend_version="-${hip_version}"
+  fi
 fi
 
 # Check that image exists, if fails, print the default error message.
-if [[ "$(docker images -q "$image"-"$FF_GPU_BACKEND""$cuda_version_hyphen":latest 2> /dev/null)" == "" ]]; then
-  echo ""
-  echo "To download the docker image, run:"
-  echo "    FF_GPU_BACKEND=${FF_GPU_BACKEND} cuda_version=${cuda_version} $(pwd)/pull.sh $image"
-  echo "To build the docker image from source, run:"
-  echo "    FF_GPU_BACKEND=${FF_GPU_BACKEND} cuda_version=${cuda_version} $(pwd)/build.sh $image"
-  echo ""
+if [[ "$(docker images -q "${image}-${FF_GPU_BACKEND}${gpu_backend_version}":latest 2> /dev/null)" == "" ]]; then
+  echo "Error, ${image}-${FF_GPU_BACKEND}${gpu_backend_version}:latest does not exist!"
+  if [[ "${FF_GPU_BACKEND}" == "cuda" ]]; then
+    echo ""
+    echo "To download the docker image, run:"
+    echo "    FF_GPU_BACKEND=${FF_GPU_BACKEND} cuda_version=${cuda_version} $(pwd)/pull.sh $image"
+    echo "To build the docker image from source, run:"
+    echo "    FF_GPU_BACKEND=${FF_GPU_BACKEND} cuda_version=${cuda_version} $(pwd)/build.sh $image"
+    echo ""
+  elif [[ "${FF_GPU_BACKEND}" == "hip_rocm" ]]; then
+    echo ""
+    echo "To download the docker image, run:"
+    echo "    FF_GPU_BACKEND=${FF_GPU_BACKEND} hip_version=${hip_version} $(pwd)/pull.sh $image"
+    echo "To build the docker image from source, run:"
+    echo "    FF_GPU_BACKEND=${FF_GPU_BACKEND} hip_version=${hip_version} $(pwd)/build.sh $image"
+    echo ""
+  fi
   exit 1
 fi
 
+inference_volumes=""
+if $ATTACH_INFERENCE_FILES ; then 
+  inference_volumes="-v ~/.cache/flexflow:/usr/FlexFlow/inference";
+fi
 
-eval docker run -it "$gpu_arg" "--shm-size=${SHM_SIZE}" "${image}-${FF_GPU_BACKEND}${cuda_version_hyphen}:latest"
+eval docker run -it "$gpu_arg" "--shm-size=${SHM_SIZE}" "${inference_volumes}" "${image}-${FF_GPU_BACKEND}${gpu_backend_version}:latest"
diff --git a/docs/Makefile b/docs/Makefile
index 5424c5bc9f..d14c2ef91f 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -15,7 +15,7 @@ help:
 .PHONY: help Makefile clean
 
 clean:
-	rm -rf build source/_doxygen/ source/c++_api/ doxygen/output
+	rm -rf build doxygen/output doxygen/cpp_api
 	@$(SPHINXBUILD) -M clean "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 
 # Catch-all target: route all unknown targets to Sphinx using the new
diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile
index b38bfc12b5..aafa65d79b 100644
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
@@ -44,7 +44,7 @@ PROJECT_NUMBER         =
 # for a project that appears at the top of each page and should give viewer a
 # quick idea about the purpose of the project. Keep the description short.
 
-PROJECT_BRIEF          = A distributed deep learning framework that supports flexible parallelization strategies.
+PROJECT_BRIEF          = "A distributed deep learning framework that supports flexible parallelization strategies."
 
 # With the PROJECT_LOGO tag one can specify a logo or an icon that is included
 # in the documentation. The maximum height of the logo should not exceed 55
@@ -150,7 +150,7 @@ INLINE_INHERITED_MEMB  = NO
 # shortest path that makes the file name unique will be used
 # The default value is: YES.
 
-FULL_PATH_NAMES        = YES
+FULL_PATH_NAMES        = NO
 
 # The STRIP_FROM_PATH tag can be used to strip a user-defined part of the path.
 # Stripping is only done if one of the specified strings matches the left-hand
@@ -874,12 +874,7 @@ WARN_LOGFILE           =
 # spaces. See also FILE_PATTERNS and EXTENSION_MAPPING
 # Note: If this tag is empty the current directory is searched.
 
-INPUT                  = $(FF_HOME)/align
-INPUT                 += $(FF_HOME)/bootcamp_demo
-INPUT                 += $(FF_HOME)/examples
 INPUT                 += $(FF_HOME)/include
-INPUT                 += $(FF_HOME)/nmt
-INPUT                 += $(FF_HOME)/python
 INPUT                 += $(FF_HOME)/src
 
 # This tag can be used to specify the character encoding of the source files
@@ -911,12 +906,10 @@ INPUT_ENCODING         = UTF-8
 
 FILE_PATTERNS          = *.c \
                          *.cc \
-                         *.cpp \
                          *.cu \
+                         *.cpp \
                          *.h \
-                         *.hpp \
-                         *.md \
-                         *.py
+                         *.hpp
 
 # The RECURSIVE tag can be used to specify whether or not subdirectories should
 # be searched for input files as well.
@@ -2110,7 +2103,7 @@ MAN_LINKS              = NO
 # captures the structure of the code including all documentation.
 # The default value is: NO.
 
-GENERATE_XML           = YES
+GENERATE_XML           = NO
 
 # The XML_OUTPUT tag is used to specify where the XML pages will be put. If a
 # relative path is entered the value of OUTPUT_DIRECTORY will be put in front of
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 0e614f37c2..f67c0dae01 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -13,28 +13,42 @@
 import os
 import sys
 import subprocess
+import shutil
+import sphinx # only needed for the manual post processing
+from pathlib import Path
+from m2r2 import convert
+from docutils.core import publish_string
+import re
 
 def get_parent_dir_path(path):
     return os.path.abspath(os.path.join(path, ".."))
 
 docs_path = get_parent_dir_path(os.path.dirname(os.path.abspath(__file__)))
 doxygen_path = os.path.join(docs_path, "doxygen")
+doxygen_output = os.path.join(doxygen_path, "output")
+doxygen_cpp_api_out = os.path.join(doxygen_path, "cpp_api")
 FF_HOME = get_parent_dir_path(docs_path)
 python_package_path = os.path.join(FF_HOME, "python")
 
 sys.path.insert(0, os.path.abspath(python_package_path))
 
 # Build the Doxygen docs
-#subprocess.call(f'cd {doxygen_path}; FF_HOME={FF_HOME} doxygen', shell=True)
+shutil.rmtree(doxygen_cpp_api_out, ignore_errors=True)
+for gpu_backend in ("cuda", "hip"):
+    doxygen_dest = os.path.join(doxygen_cpp_api_out, f"{gpu_backend}_api")
+    os.makedirs(doxygen_dest, exist_ok=True)
+    exclude_extension = ".cu" if gpu_backend == "hip" else ".cpp"
+    doxygen_cmd = f'export FF_HOME={FF_HOME}; ( cat Doxyfile ; echo "EXCLUDE_PATTERNS+=*{exclude_extension}" ) | doxygen -'
+    subprocess.check_call(doxygen_cmd, cwd=doxygen_path, shell=True)
+    subprocess.check_call(f'mv {os.path.join(doxygen_output, "html")}/* {doxygen_dest}/', shell=True)
 
 import sphinx_rtd_theme
 
 # -- Project information -----------------------------------------------------
 
 project = 'FlexFlow'
-copyright = '2020, Stanford, LANL, CMU, Facebook'
-author = 'Stanford, LANL, CMU, Facebook'
-
+copyright = '2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)'
+author = 'CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)'
 
 # -- General configuration ---------------------------------------------------
 
@@ -45,8 +59,6 @@ def get_parent_dir_path(path):
     'sphinx_rtd_theme',
     'sphinx.ext.autodoc',
     'm2r2',
-    'breathe',
-    'exhale',
 ]
 
 # Theme options are theme-specific and customize the look and feel of a theme
@@ -55,6 +67,7 @@ def get_parent_dir_path(path):
 html_theme_options = {
     "collapse_navigation" : False
 }
+html_extra_path = [doxygen_cpp_api_out]
 
 # Add any paths that contain templates here, relative to this directory.
 # templates_path = ['_templates']
@@ -86,27 +99,50 @@ def get_parent_dir_path(path):
 # so a file named "default.css" will overwrite the builtin "default.css".
 # html_static_path = ['_static']
 
-# Breathe + Exhale configuration
 
-# Setup the breathe extension
-breathe_projects = {
-    "FlexFlow": "./_doxygen/xml"
-}
-breathe_default_project = "FlexFlow"
-
-c_plus_plus_src_dirs = " ".join([f"\"{os.path.join(FF_HOME, 'src', dirname)}\"" for dirname in ("loss_functions", "mapper", "metrics_functions", "ops", "parallel_ops", "recompile", "runtime", "utils")])
-# Setup the exhale extension
-exhale_args = {
-    # These arguments are required
-    "containmentFolder":     "./c++_api",
-    "rootFileName":          "c++_api_root.rst",
-    "doxygenStripFromPath":  "..",
-    # Heavily encouraged optional argument (see docs)
-    #"rootFileTitle":         "Library API",
-    # Suggested optional arguments
-    "createTreeView":        True,
-    # TIP: if using the sphinx-bootstrap-theme, you need
-    # "treeViewIsBootstrap": True,
-    "exhaleExecutesDoxygen": True,
-    "exhaleDoxygenStdin":    f'INPUT = {c_plus_plus_src_dirs}'
-}
+def manual_post_processing(app, exception):
+    if exception is None and app.builder.name == 'html':  # build succeeded
+        print(f'Post-processing HTML docs at path {app.outdir}')
+        build_dir = Path(app.outdir)
+
+        # List of subfolders to search
+        folder_paths = [build_dir, build_dir / 'developers_guide'] 
+
+        for folder_path in folder_paths:
+
+            # Only get HTML files in build dir, not subfolders
+            html_files = folder_path.glob('*.html') 
+
+            for html_file in html_files:
+                content = html_file.read_text()
+
+                # Find dropdown menus, and manually convert their contents
+                pattern = r'<details>\n<summary>Expand here</summary>\n<br>(.*?)</details>'
+                blocks = re.findall(pattern, content, re.DOTALL)
+
+                for block in blocks:
+                    # Convert Markdown to HTML
+                    rst = convert(block, github_markdown=True)
+                    html = publish_string(rst, writer_name='html')
+                    html_str = html.decode('utf-8') 
+
+                    # Replace block with converted HTML
+                    content = content.replace(block, html_str)
+
+                # Add space after dropdown menu block
+                content = content.replace('</details></section>', 
+                                  '</details></section>\n<p></p>')
+
+                # Replace incorrect links
+                content = content.replace('href="../docker/README.md"', 'href="docker.html"')
+                content = content.replace('href="./TRAIN.md"', 'href="train_overview.html"')
+                content = content.replace('href="./SERVE.md"', 'href="serve_overview.html"')
+                content = content.replace('href="./docs/source/keras.rst"', 'href="keras.html"')
+                content = content.replace('href="./docs/source/onnx.rst"', 'href="onnx.html"')
+                
+
+                html_file.write_text(content)
+
+
+def setup(app):
+   app.connect('build-finished', manual_post_processing)
diff --git a/docs/source/cpp_api.rst b/docs/source/cpp_api.rst
new file mode 100644
index 0000000000..b5d39be62e
--- /dev/null
+++ b/docs/source/cpp_api.rst
@@ -0,0 +1,10 @@
+*************
+C++ API
+*************
+
+The FlexFlow backend is at the core of FlexFlow Train and FlexFlow Serve. It is written entirely in C/C++ and CUDA/HIP. This section documents the API, which is generated by Doxygen and it is available at the following links:
+
+* `CUDA version <./cuda_api/index.html>`_ (default version)
+* `HIP version <./hip_api/index.html>`_
+
+The two versions only differ when it comes to the GPU kernels, so the great majority of the entries are identical. If you are unsure which version to use, take a look at the CUDA version.
diff --git a/docs/source/developers_guide.rst b/docs/source/developers_guide/developers_guide.rst
similarity index 64%
rename from docs/source/developers_guide.rst
rename to docs/source/developers_guide/developers_guide.rst
index 107135fae4..a125e60460 100644
--- a/docs/source/developers_guide.rst
+++ b/docs/source/developers_guide/developers_guide.rst
@@ -2,5 +2,5 @@
 Developers Guide
 ******************
 
-.. mdinclude:: ../../CONTRIBUTING.md
+.. mdinclude:: ../../../CONTRIBUTING.md
    :start-line: 2
diff --git a/docs/source/developers_guide/ff_internals.rst b/docs/source/developers_guide/ff_internals.rst
new file mode 100644
index 0000000000..15c0804255
--- /dev/null
+++ b/docs/source/developers_guide/ff_internals.rst
@@ -0,0 +1,6 @@
+*******************
+FlexFlow Internals
+*******************
+
+.. mdinclude:: internals.md
+   :start-line: 2
diff --git a/docs/source/developers_guide/internals.md b/docs/source/developers_guide/internals.md
new file mode 100644
index 0000000000..243b14a174
--- /dev/null
+++ b/docs/source/developers_guide/internals.md
@@ -0,0 +1,15 @@
+# FlexFlow Internals
+
+## The Parallel Computation Graph (PCG)
+
+FlexFlow uses a _Parallel Computation Graph (PCG)_ to simultaneously represent tensor operations, as well as parallelism choices and data movement across nodes. 
+
+### Tensor representations
+
+There are two types of tensor representations in FlexFlow: a [Tensor](./cuda_api/de/da9/structFlexFlow_1_1TensorBase.html) and a [ParallelTensor](./cuda_api/d3/dfc/structFlexFlow_1_1ParallelTensorBase.html). The first variant is used when writing a FlexFlow DNN program, whereas the second is used by the runtime to run all the computations in a distributed fashion. `Tensor` and `ParallelTensor` are implemented as typedef-ed pointers to, respectively, the `TensorBase` (defined in `include/flexflow/tensor.h`) and `ParallelTensorBase` (defined in `include/flexflow/parallel_tensor.h`) structs. 
+
+The `ParallelTensor` struct contains all the information that a `Tensor` also stores, but in addition, it also codifies how the tensor should be parallelized. For instance, a ParallelTensor records how each dimension is *partitioned*, how many *replicas* of the tensors have been created, and the *mapping* between the partitions of the tensors and the physical machines that will store them. 
+
+## Transformation generation
+
+## Joint optimization
diff --git a/docs/source/docker.rst b/docs/source/docker.rst
index 4a457a8dcc..63f84e460c 100644
--- a/docs/source/docker.rst
+++ b/docs/source/docker.rst
@@ -1,3 +1,4 @@
+:tocdepth: 1
 *************
 Docker
 *************
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 7af62e417e..a7ea2ff3ac 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -7,47 +7,38 @@ Welcome to FlexFlow's documentation!
 ====================================
 
 .. toctree::
-   :maxdepth: 2
    :caption: Getting Started
    
    welcome
    installation
    docker
-   jupyter
+   multinode
 
 .. toctree::
-   :maxdepth: 2
-   :caption: Interoperability
+   :caption: FlexFlow Serve
    
-   keras
-   pytorch
-   onnx
+   serve_overview
 
 .. toctree::
-   :maxdepth: 2
-   :caption: Examples
-
-   mt5
+   :caption: FlexFlow Train
    
-.. toctree::
-   :maxdepth: 3
-   :caption: Python API
+   train_overview
+   train_interface
+   train_examples
    
-   python/models
-   python/layers
-   python/dataloader
+   train_python_api
 
 .. toctree::
-   :maxdepth: 2
-   :caption: C++ API
+   :caption: FlexFlow Backend
 
-   c++_api/c++_api_root
+   cpp_api
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 3
    :caption: Developers Guide
 
-   developers_guide
+   developers_guide/developers_guide.rst
+..   developers_guide/ff_internals.rst
 
 
 .. Indices and tables
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index 109b546834..95ec8596e6 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -1,5 +1,6 @@
+:tocdepth: 1
 *************
-Installing FlexFlow
+Building from source
 *************
 
 .. mdinclude:: ../../INSTALL.md
diff --git a/docs/source/keras.rst b/docs/source/keras.rst
index eb4f2d7fa7..f1c0743c70 100644
--- a/docs/source/keras.rst
+++ b/docs/source/keras.rst
@@ -1,6 +1,7 @@
-*************
-Keras Support
-*************
+:tocdepth: 1
+****************
+Keras Interface
+****************
 
 FlexFlow provides a drop-in replacement for TensorFlow Keras. Running an existing Keras program on the FlexFlow backend only requires a few lines of changes to the program. The detailed instructions are as follows:
 
diff --git a/docs/source/mt5.rst b/docs/source/mt5.rst
index c9c3af080a..8a632b90d6 100644
--- a/docs/source/mt5.rst
+++ b/docs/source/mt5.rst
@@ -1,6 +1,6 @@
-****************
-HuggingFace mT5 
-****************
+************************
+mT5 Model
+************************
 
 .. mdinclude:: ../../examples/python/pytorch/mt5/README.md
    :start-line: 2
diff --git a/docs/source/multinode.rst b/docs/source/multinode.rst
new file mode 100644
index 0000000000..8827200582
--- /dev/null
+++ b/docs/source/multinode.rst
@@ -0,0 +1,8 @@
+:tocdepth: 1
+******************
+Multinode tutorial
+******************
+
+
+.. mdinclude:: ../../MULTI-NODE.md
+   :start-line: 3
diff --git a/docs/source/onnx.rst b/docs/source/onnx.rst
index 91b314ac96..b6bc49b146 100644
--- a/docs/source/onnx.rst
+++ b/docs/source/onnx.rst
@@ -1,3 +1,4 @@
+:tocdepth: 1
 *************
 ONNX Support
 *************
diff --git a/docs/source/pytorch.rst b/docs/source/pytorch.rst
index a6d4e23311..3dbe337d55 100644
--- a/docs/source/pytorch.rst
+++ b/docs/source/pytorch.rst
@@ -1,6 +1,7 @@
-***************
-PyTorch Support
-***************
+:tocdepth: 1
+******************
+PyTorch Interface
+******************
 
 Users can use FlexFlow to optimize the parallelization performance of existing PyTorch models in two steps.
 The PyTorch support requires the `PyTorch FX module <https://github.com/pytorch/pytorch/pull/42741>`_, so make sure your PyTorch is up to date. 
diff --git a/docs/source/serve_overview.rst b/docs/source/serve_overview.rst
new file mode 100644
index 0000000000..35c992a853
--- /dev/null
+++ b/docs/source/serve_overview.rst
@@ -0,0 +1,7 @@
+:tocdepth: 1
+*************
+Serving Overview
+*************
+
+.. mdinclude:: ../../SERVE.md
+   :start-line: 3
diff --git a/docs/source/train_examples.rst b/docs/source/train_examples.rst
new file mode 100644
index 0000000000..84d58c3465
--- /dev/null
+++ b/docs/source/train_examples.rst
@@ -0,0 +1,6 @@
+*************
+Training Examples
+*************
+
+.. toctree::
+   mt5
\ No newline at end of file
diff --git a/docs/source/train_interface.rst b/docs/source/train_interface.rst
new file mode 100644
index 0000000000..ce81fc1f3c
--- /dev/null
+++ b/docs/source/train_interface.rst
@@ -0,0 +1,8 @@
+*******************
+Training Interface
+*******************
+
+.. toctree::
+   keras
+   pytorch
+   onnx
\ No newline at end of file
diff --git a/docs/source/train_overview.rst b/docs/source/train_overview.rst
new file mode 100644
index 0000000000..58898ad35c
--- /dev/null
+++ b/docs/source/train_overview.rst
@@ -0,0 +1,7 @@
+:tocdepth: 1
+*************
+Training Overview
+*************
+
+.. mdinclude:: ../../TRAIN.md
+   :start-line: 3
diff --git a/docs/source/train_python_api.rst b/docs/source/train_python_api.rst
new file mode 100644
index 0000000000..40451dedf9
--- /dev/null
+++ b/docs/source/train_python_api.rst
@@ -0,0 +1,11 @@
+*******************
+Python API
+*******************
+This section documents the Python API for FlexFlow Train.
+
+.. toctree::
+   :maxdepth: 3
+   
+   python/models
+   python/layers
+   python/dataloader
\ No newline at end of file
diff --git a/docs/source/welcome.rst b/docs/source/welcome.rst
index 8108b1dd67..7f73f15563 100644
--- a/docs/source/welcome.rst
+++ b/docs/source/welcome.rst
@@ -1,3 +1,4 @@
+:tocdepth: 1
 *************
 Overview
 *************
diff --git a/img/overview.png b/img/overview.png
new file mode 100644
index 0000000000..5264e2d41a
Binary files /dev/null and b/img/overview.png differ
diff --git a/img/performance.png b/img/performance.png
new file mode 100644
index 0000000000..668e579197
Binary files /dev/null and b/img/performance.png differ
diff --git a/img/spec_infer_demo.gif b/img/spec_infer_demo.gif
new file mode 100644
index 0000000000..c0fda87b71
Binary files /dev/null and b/img/spec_infer_demo.gif differ
diff --git a/include/flexflow/accessor.h b/include/flexflow/accessor.h
index 6f95354823..65ab33b513 100644
--- a/include/flexflow/accessor.h
+++ b/include/flexflow/accessor.h
@@ -61,6 +61,7 @@ class GenericTensorAccessorW {
   float *get_float_ptr() const;
   double *get_double_ptr() const;
   half *get_half_ptr() const;
+  char *get_byte_ptr() const;
   DataType data_type;
   Legion::Domain domain;
   void *ptr;
@@ -79,6 +80,7 @@ class GenericTensorAccessorR {
   float const *get_float_ptr() const;
   double const *get_double_ptr() const;
   half const *get_half_ptr() const;
+  char const *get_byte_ptr() const;
   DataType data_type;
   Legion::Domain domain;
   void const *ptr;
diff --git a/include/flexflow/batch_config.h b/include/flexflow/batch_config.h
new file mode 100644
index 0000000000..ce331d3e41
--- /dev/null
+++ b/include/flexflow/batch_config.h
@@ -0,0 +1,149 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "flexflow/ffconst.h"
+#include "legion.h"
+#include <cstddef>
+#include <cstdlib>
+
+// #define MAX_SEQ_LEN 1024
+// #define BATCH_SIZE 2
+// #define BATCH_SIZE 16
+// #define MAX_REQUESTS 256
+
+namespace FlexFlow {
+
+class InferenceResult;
+class BeamInferenceResult;
+
+using BatchConfigFuture = Legion::Future;
+using InferenceResultFuture = Legion::Future;
+using BeamSearchBatchConfigFuture = Legion::Future;
+using TreeVerifyBatchConfigFuture = Legion::Future;
+using BeamInferenceResultFuture = Legion::Future;
+
+class BatchConfig {
+public:
+  using RequestGuid = size_t;
+  using TokenId = int;
+  BatchConfig();
+  int num_active_requests() const;
+  int num_active_tokens() const;
+  void print() const;
+  virtual InferenceMode get_mode() const;
+  static BatchConfig const *from_future(BatchConfigFuture const &future);
+  static int const MAX_NUM_REQUESTS = 1;
+  static int const MAX_NUM_TOKENS = 64;
+  static int const MAX_PROMPT_LENGTH = 62;
+  static int const MAX_SEQ_LENGTH = 256;
+
+  //  These are set by update
+  int num_tokens;
+
+  struct PerRequestInfo {
+    int token_start_offset;
+    int num_tokens_in_batch;
+    int max_sequence_length;
+    RequestGuid request_guid;
+  };
+  struct PerTokenInfo {
+    int abs_depth_in_request;
+    int request_index;
+    TokenId token_id;
+  };
+  PerRequestInfo requestsInfo[MAX_NUM_REQUESTS];
+  PerTokenInfo tokensInfo[MAX_NUM_TOKENS];
+
+  bool request_completed[MAX_NUM_REQUESTS];
+};
+
+class TreeVerifyBatchConfig : public BatchConfig {
+public:
+  TreeVerifyBatchConfig();
+  ~TreeVerifyBatchConfig();
+  InferenceMode get_mode() const;
+  void print() const;
+  struct CommittedTokensInfo {
+    int token_index;   // the index of the token in the previous batch
+    int request_index; // request index in the batch
+    int token_depth;   // position of the token in the request's sequence
+  };
+
+  int num_tokens_to_commit;
+  CommittedTokensInfo committed_tokens[MAX_NUM_TOKENS];
+};
+
+struct InferenceResult {
+  static int const MAX_NUM_TOKENS = BatchConfig::MAX_NUM_TOKENS;
+  BatchConfig::TokenId token_ids[MAX_NUM_TOKENS];
+};
+
+class BeamSearchBatchConfig : public BatchConfig {
+public:
+  BeamSearchBatchConfig();
+  BeamSearchBatchConfig(int model_id);
+  BeamSearchBatchConfig(size_t beam_width, size_t target_iterations);
+  BeamSearchBatchConfig(BeamSearchBatchConfig const &other, int model_id);
+  InferenceMode get_mode() const;
+
+  ~BeamSearchBatchConfig();
+
+  void print() const;
+  bool done() const;
+  int max_beam_depth_all_requests() const;
+  int current_depth_all_requests() const;
+
+  size_t beam_width;
+  size_t target_iterations;
+  inline static int const MAX_BEAM_WIDTH = 1;
+  inline static int const MAX_BEAM_DEPTH = 8;
+
+  int model_id;
+  int max_init_length = 0;
+
+  struct BeamSearchPerRequestInfo {
+    int beam_size;
+    int current_depth = -1;
+    int max_depth = MAX_BEAM_DEPTH;
+
+    BatchConfig::TokenId tokens[BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+    float probs[BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+    int parent_id[BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+  };
+
+  struct BeamSearchPerTokenInfo {
+    int sub_request_index;
+  };
+
+  BeamSearchPerRequestInfo beamRequestsInfo[MAX_NUM_REQUESTS];
+  BeamSearchPerTokenInfo beamTokenInfo[MAX_NUM_TOKENS * MAX_BEAM_WIDTH];
+  // why is this == MAX_NUM_REQUESTS * MAX_BEAM_WIDTH?
+  int sub_requests[MAX_NUM_REQUESTS * MAX_BEAM_WIDTH];
+
+private:
+  size_t current_iteration;
+};
+
+struct BeamInferenceResult {
+  static int const MAX_NUM_TOKENS = BatchConfig::MAX_NUM_TOKENS;
+  BatchConfig::TokenId
+      token_ids[MAX_NUM_TOKENS * BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+  float probs[MAX_NUM_TOKENS * BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+  int parent_id[MAX_NUM_TOKENS * BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+};
+
+}; // namespace FlexFlow
diff --git a/include/flexflow/config.h b/include/flexflow/config.h
index d82b1377c7..be6c0d21da 100644
--- a/include/flexflow/config.h
+++ b/include/flexflow/config.h
@@ -37,14 +37,15 @@ namespace FlexFlow {
 // ========================================================
 // Define Runtime Constants
 // ========================================================
-#define MAX_NUM_INPUTS 256
-#define MAX_NUM_WEIGHTS 64
-#define MAX_NUM_OUTPUTS 256
-#define MAX_NUM_FUSED_OPERATORS 64
-#define MAX_NUM_FUSED_TENSORS 64
+#define MAX_NUM_INPUTS 2048
+#define MAX_NUM_WEIGHTS 2048
+#define MAX_NUM_OUTPUTS 2048
+#define MAX_NUM_FUSED_OPERATORS 2048
+#define MAX_NUM_FUSED_TENSORS 2048
 #define MAX_NUM_WORKERS 1024
 #define MAX_FILENAME 200
 #define MAX_OPNAME 128
+#define MAX_NUM_TRANSFORMER_LAYERS 100
 // DataLoader
 #define MAX_SAMPLES_PER_LOAD 64
 #define MAX_FILE_LENGTH 128
@@ -70,6 +71,9 @@ struct FFHandler {
 #endif
   void *workSpace;
   size_t workSpaceSize;
+  void *offload_reserve_space;
+  size_t offload_reserve_space_size;
+  DataType quantization_type;
   bool allowTensorOpMathConversion;
 #ifdef FF_USE_NCCL
   ncclComm_t ncclComm;
@@ -78,6 +82,8 @@ struct FFHandler {
 
 struct FFInitInfo {
   size_t workSpaceSize;
+  size_t offload_reserve_space_size;
+  DataType quantization_type;
   bool allowTensorOpMathConversion;
   // int myRank, allRanks;
 };
@@ -122,19 +128,26 @@ class FFConfig {
   size_t workSpaceSize;
   Legion::Context lg_ctx;
   Legion::Runtime *lg_hlr;
-  Legion::FieldSpace field_space;
+  // Legion::FieldSpace field_space;
   bool syntheticInput, profiling, perform_fusion;
   size_t simulator_work_space_size;
   size_t search_budget;
   float search_alpha;
   bool search_overlap_backward_update;
   CompMode computationMode;
+  bool cpu_offload;
+  size_t offload_reserve_space_size;
+  DataType quantization_type;
   // Control parallelizable dimensions
   bool only_data_parallel;
   bool enable_sample_parallel;
   bool enable_parameter_parallel;
   bool enable_attribute_parallel;
   bool enable_inplace_optimizations;
+  // Control parallelism degrees in inference
+  int data_parallelism_degree;
+  int tensor_parallelism_degree;
+  int pipeline_parallelism_degree;
   // Control Tensor Op Math Conversion
   bool allow_tensor_op_math_conversion;
   std::string dataset_path;
diff --git a/include/flexflow/ffconst.h b/include/flexflow/ffconst.h
index 5658e2923d..2f97d48997 100644
--- a/include/flexflow/ffconst.h
+++ b/include/flexflow/ffconst.h
@@ -33,6 +33,8 @@ enum DataType {
   DT_HALF = 43,
   DT_FLOAT = 44,
   DT_DOUBLE = 45,
+  DT_INT4 = 46,
+  DT_INT8 = 47,
   DT_NONE = 49,
 };
 
@@ -64,6 +66,12 @@ enum MetricsType {
   METRICS_MEAN_ABSOLUTE_ERROR = 1032,
 };
 
+enum InferenceMode {
+  INC_DECODING_MODE = 2001,
+  BEAM_SEARCH_MODE = 2002,
+  TREE_VERIFY_MODE = 2003,
+};
+
 // This is consistent with TASO's OpType
 // https://github.com/jiazhihao/TASO/blob/master/include/taso/ops.h#L75-L138
 enum OperatorType {
@@ -129,6 +137,7 @@ enum OperatorType {
   OP_SHAPE, // https://github.com/onnx/onnx/blob/master/docs/Operators.md#Shape
   OP_SIZE,  // https://github.com/onnx/onnx/blob/master/docs/Operators.md#Size
   OP_TOPK,  // https://github.com/onnx/onnx/blob/master/docs/Operators.md#TopK
+  OP_ARG_TOPK,
   OP_WHERE, // https://github.com/onnx/onnx/blob/master/docs/Operators.md#Where
   OP_CEIL,  // https://github.com/onnx/onnx/blob/master/docs/Operators.md#Ceil
   OP_CAST,  // https://github.com/onnx/onnx/blob/master/docs/Operators.md#Cast
@@ -150,17 +159,35 @@ enum OperatorType {
   OP_POW,   // https://pytorch.org/docs/stable/generated/torch.pow.html
   OP_MEAN,  // https://pytorch.org/docs/stable/generated/torch.mean.html
   OP_LAYERNORM,
+  OP_EXPERTS,
   OP_GATHER, // https://pytorch.org/docs/stable/generated/torch.gather.html
+  OP_RMS_NORM,
+  OP_BEAM_TOPK,
+  OP_ARGMAX,
+  OP_INC_MULTIHEAD_SELF_ATTENTION,
+  OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION,
+  OP_TREE_INC_MULTIHEAD_SELF_ATTENTION,
+  OP_SAMPLING,
   // Parallel Ops
   OP_REPARTITION,
   OP_COMBINE,
   OP_REPLICATE,
   OP_REDUCTION,
   OP_PIPELINE,
+  OP_ALLREDUCE,
   OP_FUSED_PARALLEL,
   OP_INVALID,
 };
 
+enum ModelType {
+  UNKNOWN = 3001,
+  LLAMA = 3002,
+  LLAMA2 = 3003,
+  OPT = 3004,
+  FALCON = 3005,
+  STARCODER = 3006
+};
+
 enum PMParameter {
   PM_OP_TYPE,            // AnyOp
   PM_NUM_INPUTS,         // AnyOp
@@ -189,6 +216,7 @@ enum PMParameter {
   PM_COMBINE_DEGREE,     // Combine
   PM_REDUCTION_DIM,      // Reduction
   PM_REDUCTION_DEGREE,   // Reduction
+  PM_ALLREDUCE_DIM,      // AllReduce
   PM_SOFTMAX_DIM,        // Softmax
   PM_NUM_HEADS,          // MultiHeadAttention
   PM_INVALID,
diff --git a/include/flexflow/ffconst_utils.h b/include/flexflow/ffconst_utils.h
index fcd881e57e..421a139d57 100644
--- a/include/flexflow/ffconst_utils.h
+++ b/include/flexflow/ffconst_utils.h
@@ -8,8 +8,16 @@ namespace FlexFlow {
 
 std::string get_operator_type_name(OperatorType type);
 
+size_t data_type_size(DataType type);
+
+#define INT4_NUM_OF_ELEMENTS_PER_GROUP 32
+
+size_t get_quantization_to_byte_size(DataType type,
+                                     DataType quantization_type,
+                                     size_t num_elements);
+
 std::ostream &operator<<(std::ostream &, OperatorType);
 
 }; // namespace FlexFlow
 
-#endif // _FLEXFLOW_FFCONST_UTILS_H
\ No newline at end of file
+#endif // _FLEXFLOW_FFCONST_UTILS_H
diff --git a/include/flexflow/fftype.h b/include/flexflow/fftype.h
index a71c85dbc8..18ed6b8100 100644
--- a/include/flexflow/fftype.h
+++ b/include/flexflow/fftype.h
@@ -8,15 +8,16 @@ namespace FlexFlow {
 
 class LayerID {
 public:
+  static const LayerID NO_ID;
   LayerID();
-  LayerID(size_t id);
+  LayerID(size_t id, size_t transformer_layer_id);
   bool is_valid_id() const;
   friend bool operator==(LayerID const &lhs, LayerID const &rhs);
 
 public:
-  size_t id;
+  size_t id, transformer_layer_id;
 };
 
 }; // namespace FlexFlow
 
-#endif // _FF_TYPE_H
\ No newline at end of file
+#endif // _FF_TYPE_H
diff --git a/include/flexflow/flexflow_c.h b/include/flexflow/flexflow_c.h
index 16ce3ac205..003533bb80 100644
--- a/include/flexflow/flexflow_c.h
+++ b/include/flexflow/flexflow_c.h
@@ -47,6 +47,14 @@ FF_NEW_OPAQUE_TYPE(flexflow_dlrm_config_t);
 FF_NEW_OPAQUE_TYPE(flexflow_dataloader_4d_t);
 FF_NEW_OPAQUE_TYPE(flexflow_dataloader_2d_t);
 FF_NEW_OPAQUE_TYPE(flexflow_single_dataloader_t);
+// Inference
+FF_NEW_OPAQUE_TYPE(flexflow_batch_config_t);
+FF_NEW_OPAQUE_TYPE(flexflow_tree_verify_batch_config_t);
+FF_NEW_OPAQUE_TYPE(flexflow_beam_search_batch_config_t);
+FF_NEW_OPAQUE_TYPE(flexflow_inference_manager_t);
+FF_NEW_OPAQUE_TYPE(flexflow_request_manager_t);
+FF_NEW_OPAQUE_TYPE(flexflow_file_data_loader_t);
+FF_NEW_OPAQUE_TYPE(flexflow_generation_result_t);
 
 // -----------------------------------------------------------------------
 // FFConfig
@@ -72,12 +80,31 @@ int flexflow_config_get_epochs(flexflow_config_t handle);
 
 bool flexflow_config_get_enable_control_replication(flexflow_config_t handle);
 
+int flexflow_config_get_data_parallelism_degree(flexflow_config_t handle_);
+
+int flexflow_config_get_tensor_parallelism_degree(flexflow_config_t handle_);
+
+int flexflow_config_get_pipeline_parallelism_degree(flexflow_config_t handle_);
+
+void flexflow_config_set_data_parallelism_degree(flexflow_config_t handle_,
+                                                 int value);
+
+void flexflow_config_set_tensor_parallelism_degree(flexflow_config_t handle_,
+                                                   int value);
+
+void flexflow_config_set_pipeline_parallelism_degree(flexflow_config_t handle_,
+                                                     int value);
+
 int flexflow_config_get_python_data_loader_type(flexflow_config_t handle);
+
+bool flexflow_config_get_offload(flexflow_config_t handle);
+
 // -----------------------------------------------------------------------
 // FFModel
 // -----------------------------------------------------------------------
 
-flexflow_model_t flexflow_model_create(flexflow_config_t config);
+flexflow_model_t flexflow_model_create(flexflow_config_t config,
+                                       bool cpu_offload);
 
 void flexflow_model_destroy(flexflow_model_t handle);
 
@@ -197,9 +224,10 @@ flexflow_tensor_t
 flexflow_tensor_t
     flexflow_model_add_embedding(flexflow_model_t handle,
                                  const flexflow_tensor_t input,
-                                 int num_entires,
+                                 int num_entries,
                                  int out_dim,
                                  enum AggrMode aggr,
+                                 enum DataType dtype,
                                  flexflow_op_t shared_op,
                                  flexflow_initializer_t kernel_initializer,
                                  char const *name);
@@ -371,6 +399,151 @@ flexflow_tensor_t flexflow_model_add_multihead_attention(
     flexflow_initializer_t kernel_initializer,
     char const *name);
 
+flexflow_tensor_t flexflow_model_add_inc_multihead_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name);
+
+flexflow_tensor_t flexflow_model_add_spec_inc_multihead_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name);
+
+flexflow_tensor_t flexflow_model_add_inc_multihead_self_attention_verify(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name);
+
+flexflow_tensor_t flexflow_model_add_inc_multiquery_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name);
+
+flexflow_tensor_t flexflow_model_add_spec_inc_multiquery_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name);
+
+flexflow_tensor_t flexflow_model_add_inc_multiquery_self_attention_verify(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name);
+
+flexflow_tensor_t flexflow_model_add_rms_norm(flexflow_model_t handle_,
+                                              const flexflow_tensor_t input_,
+                                              float eps,
+                                              int dim,
+                                              char const *name);
+
+flexflow_tensor_t flexflow_model_add_arg_top_k(flexflow_model_t handle_,
+                                               const flexflow_tensor_t input_,
+                                               int k,
+                                               bool sorted,
+                                               char const *name);
+
+flexflow_tensor_t flexflow_model_add_beam_top_k(flexflow_model_t handle_,
+                                                const flexflow_tensor_t input_,
+                                                int max_beam_size,
+                                                bool sorted,
+                                                char const *name);
+
+flexflow_tensor_t flexflow_model_add_sampling(flexflow_model_t handle_,
+                                              const flexflow_tensor_t input_,
+                                              float top_p,
+                                              char const *name);
+
+flexflow_tensor_t flexflow_model_add_argmax(flexflow_model_t handle_,
+                                            const flexflow_tensor_t input_,
+                                            bool beam_search,
+                                            char const *name);
+
 void flexflow_model_set_sgd_optimizer(flexflow_model_t handle,
                                       flexflow_sgd_optimizer_t optimizer);
 
@@ -390,6 +563,18 @@ flexflow_tensor_t flexflow_model_get_parameter_by_id(flexflow_model_t handle,
 flexflow_perf_metrics_t
     flexflow_model_get_perf_metrics(flexflow_model_t handle);
 
+void flexflow_model_set_transformer_layer_id(flexflow_model_t handle, int id);
+
+flexflow_generation_result_t
+    flexflow_model_generate(flexflow_model_t handle_,
+                            char const *input_text,
+                            int max_num_chars,
+                            char *output_text,
+                            int max_seq_length,
+                            int *output_length_and_tokens);
+
+void flexflow_model_set_position_offset(flexflow_model_t handle, int offset);
+
 // -----------------------------------------------------------------------
 // Tensor
 // -----------------------------------------------------------------------
@@ -699,6 +884,92 @@ void flexflow_op_forward(flexflow_op_t handle, flexflow_model_t model);
 
 void flexflow_perform_registration(void);
 
+// -----------------------------------------------------------------------
+// BatchConfig
+// -----------------------------------------------------------------------
+
+flexflow_batch_config_t flexflow_batch_config_create(void);
+
+void flexflow_batch_config_destroy(flexflow_batch_config_t handle);
+
+// -----------------------------------------------------------------------
+// TreeVerifyBatchConfig
+// -----------------------------------------------------------------------
+
+flexflow_tree_verify_batch_config_t
+    flexflow_tree_verify_batch_config_create(void);
+
+void flexflow_tree_verify_batch_config_destroy(
+    flexflow_tree_verify_batch_config_t handle);
+
+// -----------------------------------------------------------------------
+// BeamSearchBatchConfig
+// -----------------------------------------------------------------------
+
+flexflow_beam_search_batch_config_t
+    flexflow_beam_search_batch_config_create(void);
+
+void flexflow_beam_search_batch_config_destroy(
+    flexflow_beam_search_batch_config_t handle);
+
+// -----------------------------------------------------------------------
+// RequestManager
+// -----------------------------------------------------------------------
+
+flexflow_request_manager_t flexflow_request_manager_get_request_manager(void);
+
+// void flexflow_request_manager_destroy(flexflow_request_manager_t handle_);
+
+void flexflow_request_manager_register_tokenizer(
+    flexflow_request_manager_t handle_,
+    enum ModelType model_type,
+    int bos_token_id,
+    int eos_token_id,
+    char const *tokenizer_filepath);
+
+void flexflow_request_manager_register_output_filepath(
+    flexflow_request_manager_t handle_, char const *output_filepath);
+
+int flexflow_request_manager_register_ssm_model(
+    flexflow_request_manager_t handle_, flexflow_model_t model_handle_);
+
+// -----------------------------------------------------------------------
+// InferenceManager
+// -----------------------------------------------------------------------
+
+flexflow_inference_manager_t
+    flexflow_inference_manager_get_inference_manager(void);
+
+// void flexflow_inference_manager_destroy(flexflow_inference_manager_t
+// handle_);
+
+void flexflow_inference_manager_compile_model_and_allocate_buffer(
+    flexflow_inference_manager_t handle_, flexflow_model_t model_handle);
+
+void flexflow_inference_manager_init_operators_inference(
+    flexflow_inference_manager_t handle_, flexflow_model_t model_handle);
+
+// -----------------------------------------------------------------------
+// FileDataLoader
+// -----------------------------------------------------------------------
+
+flexflow_file_data_loader_t
+    flexflow_file_data_loader_create(char const *weight_file_path,
+                                     int num_q_heads,
+                                     int num_kv_heads,
+                                     int hidden_dim,
+                                     int qkv_inner_dim,
+                                     int tensor_parallelism_degree);
+
+void flexflow_file_data_loader_destroy(flexflow_file_data_loader_t handle_);
+
+void flexflow_file_data_loader_load_weights(flexflow_file_data_loader_t handle_,
+                                            flexflow_model_t model_handle_,
+                                            int num_layers,
+                                            char const **layer_names,
+                                            flexflow_op_t *layers,
+                                            bool use_full_precision);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/flexflow/gpt_tokenizer.h b/include/flexflow/gpt_tokenizer.h
new file mode 100644
index 0000000000..ec08435809
--- /dev/null
+++ b/include/flexflow/gpt_tokenizer.h
@@ -0,0 +1,221 @@
+// version 0.1
+// Licensed under the MIT License <http://opensource.org/licenses/MIT>.
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2019-2020 zili wang <wzlnot@gmail.com>.
+
+#include <algorithm>
+#include <cctype>
+#include <codecvt>
+#include <fstream>
+#include <iostream>
+#include <nlohmann/json.hpp>
+#include <regex>
+#include <stdint.h>
+#include <string>
+#include <tuple>
+#include <unordered_map>
+#include <unordered_set>
+#include <utility>
+
+using json = nlohmann::json;
+
+typedef std::pair<std::string, std::string> bigram_pair;
+typedef std::pair<std::wstring, std::wstring> wbigram_pair;
+
+struct hash_pair {
+  template <class T1, class T2>
+  size_t operator()(std::pair<T1, T2> const &p) const {
+    auto hash1 = std::hash<T1>{}(p.first);
+    auto hash2 = std::hash<T2>{}(p.second);
+    return hash1 ^ hash2;
+  }
+};
+
+enum tokenizer_mode { GPT2_TOKENIZER, OPT_TOKENIZER };
+
+class GPT_Tokenizer {
+
+public:
+  GPT_Tokenizer(tokenizer_mode mode_,
+                std::string const &vocab_file,
+                std::string const &merge_file,
+                std::string const &bos_token_str = "<s>",
+                const std::string eos_token_str = "</s>",
+                const std::string pad_token_str = "<pad>",
+                const std::string unk_token_str = "<unk>",
+                const std::string mask_token_str = "<mask>") {
+    mode = mode_;
+    load_vocab(vocab_file);
+    load_merge(merge_file);
+    bos_token = bos_token_str;
+    eos_token = eos_token_str;
+    pad_token = pad_token_str;
+    unk_token = unk_token_str;
+    mask_token = mask_token_str;
+    bytes_encoder = bytes_to_unicode();
+    unicode_to_bytes();
+  };
+  // ~GPT_Tokenizer();
+  std::vector<std::string> bpe(std::wstring token);
+  std::vector<std::string> tokenize(std::string str);
+  int32_t convert_token_to_id(std::string token);
+  void encode(std::string str,
+              size_t max_length,
+              std::vector<int32_t> *input_ids,
+              std::vector<int32_t> *mask_ids);
+  std::string decode(std::vector<int32_t> input_ids,
+                     std::vector<int32_t> mask_ids);
+  tokenizer_mode mode;
+  std::string bos_token;
+  std::string eos_token;
+  std::string pad_token;
+  std::string unk_token;
+  std::string mask_token;
+  std::string strip(std::string const &inpt);
+
+private:
+  std::unordered_map<std::string, int32_t> vocab;
+  std::unordered_map<int32_t, std::string> inverse_vocab;
+  std::unordered_map<wbigram_pair, uint32_t, hash_pair> bpe_ranks;
+  wchar_t *bytes_to_unicode();
+  void unicode_to_bytes();
+  wchar_t *bytes_encoder;
+  std::unordered_map<wchar_t, char> bytes_decoder;
+  uint32_t cache_max_size = 500000;
+  uint32_t cache_word_max_length = 30;
+  std::string unicode_letter_expr =
+      "\\u0041-\\u005A\\u0061-\\u007A\\u00AA-\\u00AA\\u00B5-\\u00B5"
+      "\\u00BA-\\u00BA\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02C1"
+      "\\u02C6-\\u02D1\\u02E0-\\u02E4\\u02EC-\\u02EC\\u02EE-\\u02EE"
+      "\\u0370-\\u0374\\u0376-\\u0377\\u037A-\\u037D\\u037F-\\u037F"
+      "\\u0386-\\u0386\\u0388-\\u038A\\u038C-\\u038C\\u038E-\\u03A1"
+      "\\u03A3-\\u03F5\\u03F7-\\u0481\\u048A-\\u052F\\u0531-\\u0556"
+      "\\u0559-\\u0559\\u0560-\\u0588\\u05D0-\\u05EA\\u05EF-\\u05F2"
+      "\\u0620-\\u064A\\u066E-\\u066F\\u0671-\\u06D3\\u06D5-\\u06D5"
+      "\\u06E5-\\u06E6\\u06EE-\\u06EF\\u06FA-\\u06FC\\u06FF-\\u06FF"
+      "\\u0710-\\u0710\\u0712-\\u072F\\u074D-\\u07A5\\u07B1-\\u07B1"
+      "\\u07CA-\\u07EA\\u07F4-\\u07F5\\u07FA-\\u07FA\\u0800-\\u0815"
+      "\\u081A-\\u081A\\u0824-\\u0824\\u0828-\\u0828\\u0840-\\u0858"
+      "\\u0860-\\u086A\\u08A0-\\u08B4\\u08B6-\\u08C7\\u0904-\\u0939"
+      "\\u093D-\\u093D\\u0950-\\u0950\\u0958-\\u0961\\u0971-\\u0980"
+      "\\u0985-\\u098C\\u098F-\\u0990\\u0993-\\u09A8\\u09AA-\\u09B0"
+      "\\u09B2-\\u09B2\\u09B6-\\u09B9\\u09BD-\\u09BD\\u09CE-\\u09CE"
+      "\\u09DC-\\u09DD\\u09DF-\\u09E1\\u09F0-\\u09F1\\u09FC-\\u09FC"
+      "\\u0A05-\\u0A0A\\u0A0F-\\u0A10\\u0A13-\\u0A28\\u0A2A-\\u0A30"
+      "\\u0A32-\\u0A33\\u0A35-\\u0A36\\u0A38-\\u0A39\\u0A59-\\u0A5C"
+      "\\u0A5E-\\u0A5E\\u0A72-\\u0A74\\u0A85-\\u0A8D\\u0A8F-\\u0A91"
+      "\\u0A93-\\u0AA8\\u0AAA-\\u0AB0\\u0AB2-\\u0AB3\\u0AB5-\\u0AB9"
+      "\\u0ABD-\\u0ABD\\u0AD0-\\u0AD0\\u0AE0-\\u0AE1\\u0AF9-\\u0AF9"
+      "\\u0B05-\\u0B0C\\u0B0F-\\u0B10\\u0B13-\\u0B28\\u0B2A-\\u0B30"
+      "\\u0B32-\\u0B33\\u0B35-\\u0B39\\u0B3D-\\u0B3D\\u0B5C-\\u0B5D"
+      "\\u0B5F-\\u0B61\\u0B71-\\u0B71\\u0B83-\\u0B83\\u0B85-\\u0B8A"
+      "\\u0B8E-\\u0B90\\u0B92-\\u0B95\\u0B99-\\u0B9A\\u0B9C-\\u0B9C"
+      "\\u0B9E-\\u0B9F\\u0BA3-\\u0BA4\\u0BA8-\\u0BAA\\u0BAE-\\u0BB9"
+      "\\u0BD0-\\u0BD0\\u0C05-\\u0C0C\\u0C0E-\\u0C10\\u0C12-\\u0C28"
+      "\\u0C2A-\\u0C39\\u0C3D-\\u0C3D\\u0C58-\\u0C5A\\u0C60-\\u0C61"
+      "\\u0C80-\\u0C80\\u0C85-\\u0C8C\\u0C8E-\\u0C90\\u0C92-\\u0CA8"
+      "\\u0CAA-\\u0CB3\\u0CB5-\\u0CB9\\u0CBD-\\u0CBD\\u0CDE-\\u0CDE"
+      "\\u0CE0-\\u0CE1\\u0CF1-\\u0CF2\\u0D04-\\u0D0C\\u0D0E-\\u0D10"
+      "\\u0D12-\\u0D3A\\u0D3D-\\u0D3D\\u0D4E-\\u0D4E\\u0D54-\\u0D56"
+      "\\u0D5F-\\u0D61\\u0D7A-\\u0D7F\\u0D85-\\u0D96\\u0D9A-\\u0DB1"
+      "\\u0DB3-\\u0DBB\\u0DBD-\\u0DBD\\u0DC0-\\u0DC6\\u0E01-\\u0E30"
+      "\\u0E32-\\u0E33\\u0E40-\\u0E46\\u0E81-\\u0E82\\u0E84-\\u0E84"
+      "\\u0E86-\\u0E8A\\u0E8C-\\u0EA3\\u0EA5-\\u0EA5\\u0EA7-\\u0EB0"
+      "\\u0EB2-\\u0EB3\\u0EBD-\\u0EBD\\u0EC0-\\u0EC4\\u0EC6-\\u0EC6"
+      "\\u0EDC-\\u0EDF\\u0F00-\\u0F00\\u0F40-\\u0F47\\u0F49-\\u0F6C"
+      "\\u0F88-\\u0F8C\\u1000-\\u102A\\u103F-\\u103F\\u1050-\\u1055"
+      "\\u105A-\\u105D\\u1061-\\u1061\\u1065-\\u1066\\u106E-\\u1070"
+      "\\u1075-\\u1081\\u108E-\\u108E\\u10A0-\\u10C5\\u10C7-\\u10C7"
+      "\\u10CD-\\u10CD\\u10D0-\\u10FA\\u10FC-\\u1248\\u124A-\\u124D"
+      "\\u1250-\\u1256\\u1258-\\u1258\\u125A-\\u125D\\u1260-\\u1288"
+      "\\u128A-\\u128D\\u1290-\\u12B0\\u12B2-\\u12B5\\u12B8-\\u12BE"
+      "\\u12C0-\\u12C0\\u12C2-\\u12C5\\u12C8-\\u12D6\\u12D8-\\u1310"
+      "\\u1312-\\u1315\\u1318-\\u135A\\u1380-\\u138F\\u13A0-\\u13F5"
+      "\\u13F8-\\u13FD\\u1401-\\u166C\\u166F-\\u167F\\u1681-\\u169A"
+      "\\u16A0-\\u16EA\\u16F1-\\u16F8\\u1700-\\u170C\\u170E-\\u1711"
+      "\\u1720-\\u1731\\u1740-\\u1751\\u1760-\\u176C\\u176E-\\u1770"
+      "\\u1780-\\u17B3\\u17D7-\\u17D7\\u17DC-\\u17DC\\u1820-\\u1878"
+      "\\u1880-\\u1884\\u1887-\\u18A8\\u18AA-\\u18AA\\u18B0-\\u18F5"
+      "\\u1900-\\u191E\\u1950-\\u196D\\u1970-\\u1974\\u1980-\\u19AB"
+      "\\u19B0-\\u19C9\\u1A00-\\u1A16\\u1A20-\\u1A54\\u1AA7-\\u1AA7"
+      "\\u1B05-\\u1B33\\u1B45-\\u1B4B\\u1B83-\\u1BA0\\u1BAE-\\u1BAF"
+      "\\u1BBA-\\u1BE5\\u1C00-\\u1C23\\u1C4D-\\u1C4F\\u1C5A-\\u1C7D"
+      "\\u1C80-\\u1C88\\u1C90-\\u1CBA\\u1CBD-\\u1CBF\\u1CE9-\\u1CEC"
+      "\\u1CEE-\\u1CF3\\u1CF5-\\u1CF6\\u1CFA-\\u1CFA\\u1D00-\\u1DBF"
+      "\\u1E00-\\u1F15\\u1F18-\\u1F1D\\u1F20-\\u1F45\\u1F48-\\u1F4D"
+      "\\u1F50-\\u1F57\\u1F59-\\u1F59\\u1F5B-\\u1F5B\\u1F5D-\\u1F5D"
+      "\\u1F5F-\\u1F7D\\u1F80-\\u1FB4\\u1FB6-\\u1FBC\\u1FBE-\\u1FBE"
+      "\\u1FC2-\\u1FC4\\u1FC6-\\u1FCC\\u1FD0-\\u1FD3\\u1FD6-\\u1FDB"
+      "\\u1FE0-\\u1FEC\\u1FF2-\\u1FF4\\u1FF6-\\u1FFC\\u2071-\\u2071"
+      "\\u207F-\\u207F\\u2090-\\u209C\\u2102-\\u2102\\u2107-\\u2107"
+      "\\u210A-\\u2113\\u2115-\\u2115\\u2119-\\u211D\\u2124-\\u2124"
+      "\\u2126-\\u2126\\u2128-\\u2128\\u212A-\\u212D\\u212F-\\u2139"
+      "\\u213C-\\u213F\\u2145-\\u2149\\u214E-\\u214E\\u2183-\\u2184"
+      "\\u2C00-\\u2C2E\\u2C30-\\u2C5E\\u2C60-\\u2CE4\\u2CEB-\\u2CEE"
+      "\\u2CF2-\\u2CF3\\u2D00-\\u2D25\\u2D27-\\u2D27\\u2D2D-\\u2D2D"
+      "\\u2D30-\\u2D67\\u2D6F-\\u2D6F\\u2D80-\\u2D96\\u2DA0-\\u2DA6"
+      "\\u2DA8-\\u2DAE\\u2DB0-\\u2DB6\\u2DB8-\\u2DBE\\u2DC0-\\u2DC6"
+      "\\u2DC8-\\u2DCE\\u2DD0-\\u2DD6\\u2DD8-\\u2DDE\\u2E2F-\\u2E2F"
+      "\\u3005-\\u3006\\u3031-\\u3035\\u303B-\\u303C\\u3041-\\u3096"
+      "\\u309D-\\u309F\\u30A1-\\u30FA\\u30FC-\\u30FF\\u3105-\\u312F"
+      "\\u3131-\\u318E\\u31A0-\\u31BF\\u31F0-\\u31FF\\u3400-\\u4DBF"
+      "\\u4E00-\\u9FFC\\uA000-\\uA48C\\uA4D0-\\uA4FD\\uA500-\\uA60C"
+      "\\uA610-\\uA61F\\uA62A-\\uA62B\\uA640-\\uA66E\\uA67F-\\uA69D"
+      "\\uA6A0-\\uA6E5\\uA717-\\uA71F\\uA722-\\uA788\\uA78B-\\uA7BF"
+      "\\uA7C2-\\uA7CA\\uA7F5-\\uA801\\uA803-\\uA805\\uA807-\\uA80A"
+      "\\uA80C-\\uA822\\uA840-\\uA873\\uA882-\\uA8B3\\uA8F2-\\uA8F7"
+      "\\uA8FB-\\uA8FB\\uA8FD-\\uA8FE\\uA90A-\\uA925\\uA930-\\uA946"
+      "\\uA960-\\uA97C\\uA984-\\uA9B2\\uA9CF-\\uA9CF\\uA9E0-\\uA9E4"
+      "\\uA9E6-\\uA9EF\\uA9FA-\\uA9FE\\uAA00-\\uAA28\\uAA40-\\uAA42"
+      "\\uAA44-\\uAA4B\\uAA60-\\uAA76\\uAA7A-\\uAA7A\\uAA7E-\\uAAAF"
+      "\\uAAB1-\\uAAB1\\uAAB5-\\uAAB6\\uAAB9-\\uAABD\\uAAC0-\\uAAC0"
+      "\\uAAC2-\\uAAC2\\uAADB-\\uAADD\\uAAE0-\\uAAEA\\uAAF2-\\uAAF4"
+      "\\uAB01-\\uAB06\\uAB09-\\uAB0E\\uAB11-\\uAB16\\uAB20-\\uAB26"
+      "\\uAB28-\\uAB2E\\uAB30-\\uAB5A\\uAB5C-\\uAB69\\uAB70-\\uABE2"
+      "\\uAC00-\\uD7A3\\uD7B0-\\uD7C6\\uD7CB-\\uD7FB\\uF900-\\uFA6D"
+      "\\uFA70-\\uFAD9\\uFB00-\\uFB06\\uFB13-\\uFB17\\uFB1D-\\uFB1D"
+      "\\uFB1F-\\uFB28\\uFB2A-\\uFB36\\uFB38-\\uFB3C\\uFB3E-\\uFB3E"
+      "\\uFB40-\\uFB41\\uFB43-\\uFB44\\uFB46-\\uFBB1\\uFBD3-\\uFD3D"
+      "\\uFD50-\\uFD8F\\uFD92-\\uFDC7\\uFDF0-\\uFDFB\\uFE70-\\uFE74"
+      "\\uFE76-\\uFEFC\\uFF21-\\uFF3A\\uFF41-\\uFF5A\\uFF66-\\uFFBE"
+      "\\uFFC2-\\uFFC7\\uFFCA-\\uFFCF\\uFFD2-\\uFFD7\\uFFDA-\\uFFDC";
+
+  std::string unicode_number_expr =
+      "\\u0030-\\u0039\\u00B2-\\u00B3\\u00B9-\\u00B9\\u00BC-\\u00BE"
+      "\\u0660-\\u0669\\u06F0-\\u06F9\\u07C0-\\u07C9\\u0966-\\u096F"
+      "\\u09E6-\\u09EF\\u09F4-\\u09F9\\u0A66-\\u0A6F\\u0AE6-\\u0AEF"
+      "\\u0B66-\\u0B6F\\u0B72-\\u0B77\\u0BE6-\\u0BF2\\u0C66-\\u0C6F"
+      "\\u0C78-\\u0C7E\\u0CE6-\\u0CEF\\u0D58-\\u0D5E\\u0D66-\\u0D78"
+      "\\u0DE6-\\u0DEF\\u0E50-\\u0E59\\u0ED0-\\u0ED9\\u0F20-\\u0F33"
+      "\\u1040-\\u1049\\u1090-\\u1099\\u1369-\\u137C\\u16EE-\\u16F0"
+      "\\u17E0-\\u17E9\\u17F0-\\u17F9\\u1810-\\u1819\\u1946-\\u194F"
+      "\\u19D0-\\u19DA\\u1A80-\\u1A89\\u1A90-\\u1A99\\u1B50-\\u1B59"
+      "\\u1BB0-\\u1BB9\\u1C40-\\u1C49\\u1C50-\\u1C59\\u2070-\\u2070"
+      "\\u2074-\\u2079\\u2080-\\u2089\\u2150-\\u2182\\u2185-\\u2189"
+      "\\u2460-\\u249B\\u24EA-\\u24FF\\u2776-\\u2793\\u2CFD-\\u2CFD"
+      "\\u3007-\\u3007\\u3021-\\u3029\\u3038-\\u303A\\u3192-\\u3195"
+      "\\u3220-\\u3229\\u3248-\\u324F\\u3251-\\u325F\\u3280-\\u3289"
+      "\\u32B1-\\u32BF\\uA620-\\uA629\\uA6E6-\\uA6EF\\uA830-\\uA835"
+      "\\uA8D0-\\uA8D9\\uA900-\\uA909\\uA9D0-\\uA9D9\\uA9F0-\\uA9F9"
+      "\\uAA50-\\uAA59\\uABF0-\\uABF9\\uFF10-\\uFF19";
+
+  std::wstring wpat_expr = utf8_to_wstring(
+      "'s|'t|'re|'ve|'m|'ll|'d| ?[" + unicode_letter_expr + "]+| ?[" +
+      unicode_number_expr + "]+| ?[^\\s" + unicode_letter_expr +
+      unicode_number_expr + "]+|\\s+(?!\\S)|\\s+");
+
+  const std::wregex pat = std::wregex(wpat_expr);
+  std::unordered_map<std::wstring, std::vector<std::string>> cache;
+  void load_vocab(std::string const &vocab_file);
+  void load_merge(std::string const &merge_file);
+
+  std::unordered_set<wbigram_pair, hash_pair>
+      get_pairs(std::vector<std::wstring> word);
+  std::wstring utf8_to_wstring(std::string const &src);
+  std::u32string utf8_to_utf32(std::string const &src);
+  std::string wstring_to_utf8(std::wstring const &src);
+  std::string utf32_to_utf8(std::u32string const &src);
+
+  std::vector<std::string> split(std::string const &s,
+                                 std::regex rgx = std::regex("\\s+"));
+};
diff --git a/include/flexflow/inference.h b/include/flexflow/inference.h
new file mode 100644
index 0000000000..f24a797ffd
--- /dev/null
+++ b/include/flexflow/inference.h
@@ -0,0 +1,50 @@
+/* Copyright 2022 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+#include "flexflow/batch_config.h"
+#include <string>
+#include <vector>
+
+namespace FlexFlow {
+
+struct GenerationConfig {
+  bool do_sample = false;
+  float temperature = 0.8;
+  float topp = 0.6;
+  GenerationConfig(bool _do_sample, float _temperature, float _topp) {
+    temperature = _temperature > 0 ? _temperature : temperature;
+    topp = _topp > 0 ? _topp : topp;
+    do_sample = _do_sample;
+  }
+  GenerationConfig() {}
+};
+
+struct GenerationResult {
+  using RequestGuid = BatchConfig::RequestGuid;
+  using TokenId = BatchConfig::TokenId;
+  RequestGuid guid;
+  std::string input_text;
+  std::string output_text;
+  std::vector<TokenId> input_tokens;
+  std::vector<TokenId> output_tokens;
+};
+
+#include <string>
+#include <vector>
+
+std::string join_path(std::vector<std::string> const &paths);
+
+} // namespace FlexFlow
diff --git a/include/flexflow/model.h b/include/flexflow/model.h
index cb1b26d624..bc3c7e6545 100644
--- a/include/flexflow/model.h
+++ b/include/flexflow/model.h
@@ -17,6 +17,7 @@
 #include "accessor.h"
 #include "config.h"
 #include "device.h"
+#include "flexflow/inference.h"
 #include "flexflow/memory_optimization.h"
 #include "flexflow/node.h"
 #include "flexflow/operator_params.h"
@@ -30,6 +31,7 @@
 #include "optimizer.h"
 #include "parallel_tensor.h"
 #include "recompile.h"
+#include "runtime.h"
 #include "simulator.h"
 #include "tensor.h"
 #include "tl/optional.hpp"
@@ -55,6 +57,10 @@ enum TaskIDs {
   ELEMENTUNARY_INIT_TASK_ID,
   ELEMENTUNARY_FWD_TASK_ID,
   ELEMENTUNARY_BWD_TASK_ID,
+  EXPERTS_INIT_TASK_ID,
+  EXPERTS_FWD_TASK_ID,
+  EXPERTS_BWD_TASK_ID,
+  EXPERTS_INF_TASK_ID,
   CONV2D_INIT_TASK_ID,
   CONV2D_INIT_PARA_TASK_ID,
   CONV2D_FWD_TASK_ID,
@@ -99,6 +105,7 @@ enum TaskIDs {
   LAYERNORM_BWD_TASK_ID,
   LINEAR_INIT_TASK_ID,
   LINEAR_INIT_PARA_TASK_ID,
+  LINEAR_INF_TASK_ID,
   LINEAR_FWD_TASK_ID,
   LINEAR_BWD_TASK_ID,
   LINEAR_BWD2_TASK_ID,
@@ -109,6 +116,7 @@ enum TaskIDs {
   SOFTMAX_INIT_TASK_ID,
   SOFTMAX_FWD_TASK_ID,
   SOFTMAX_BWD_TASK_ID,
+  SOFTMAX_INF_TASK_ID,
   CONCAT_INIT_TASK_ID,
   CONCAT_FWD_TASK_ID,
   CONCAT_BWD_TASK_ID,
@@ -127,16 +135,36 @@ enum TaskIDs {
   TOPK_INIT_TASK_ID,
   TOPK_FWD_TASK_ID,
   TOPK_BWD_TASK_ID,
+  ARG_TOPK_INIT_TASK_ID,
+  ARG_TOPK_INF_TASK_ID,
+  SAMPLING_INIT_TASK_ID,
+  SAMPLING_INF_TASK_ID,
+  ARGMAX_INIT_TASK_ID,
+  ARGMAX_BEAM_INF_TASK_ID,
+  ARGMAX_NORM_INF_TASK_ID,
   TRANSPOSE_INIT_TASK_ID,
   TRANSPOSE_FWD_TASK_ID,
   TRANSPOSE_BWD_TASK_ID,
   ATTENTION_INIT_TASK_ID,
   ATTENTION_FWD_TASK_ID,
   ATTENTION_BWD_TASK_ID,
+  RMSNROM_INIT_TASK_ID,
+  RMSNROM_FWD_TASK_ID,
+  BEAM_TOPK_INIT_TASK_ID,
+  BEAM_TOPK_INF_TASK_ID,
+  INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+  INC_MULTIHEAD_SELF_ATTENTION_FWD_TASK_ID,
+  INC_MULTIHEAD_SELF_ATTENTION_BWD_TASK_ID,
+  INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+  SPEC_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+  SPEC_INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+  TREE_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+  TREE_INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
   MSELOSS_BWD_TASK_ID,
   FUSEDOP_INIT_TASK_ID,
   FUSEDOP_FWD_TASK_ID,
   FUSEDOP_BWD_TASK_ID,
+  FUSEDOP_INF_TASK_ID,
   NOOP_INIT_TASK_ID,
   // Metrics tasks
   METRICS_COMP_TASK_ID,
@@ -190,9 +218,20 @@ enum TaskIDs {
   PIPELINE_INIT_TASK_ID,
   PIPELINE_FWD_TASK_ID,
   PIPELINE_BWD_TASK_ID,
+  ALLREDUCE_INIT_TASK_ID,
+  ALLREDUCE_INF_TASK_ID,
+  ALLREDUCE_FWD_TASK_ID,
+  ALLREDUCE_BWD_TASK_ID,
   FUSED_PARALLELOP_INIT_TASK_ID,
   FUSED_PARALLELOP_FWD_TASK_ID,
   FUSED_PARALLELOP_BWD_TASK_ID,
+  // InferenceManager & RequestManager
+  RM_LOAD_TOKENS_TASK_ID,
+  RM_LOAD_POSITION_TASK_ID,
+  RM_PREPARE_NEXT_BATCH_TASK_ID,
+  RM_PREPARE_NEXT_BATCH_BEAM_TASK_ID,
+  RM_PREPARE_NEXT_BATCH_INIT_TASK_ID,
+  RM_PREPARE_NEXT_BATCH_VERIFY_TASK_ID,
   // Custom tasks
   CUSTOM_GPU_TASK_ID_FIRST,
   CUSTOM_GPU_TASK_ID_1,
@@ -216,6 +255,8 @@ enum TaskIDs {
   // Make sure PYTHON_TOP_LEVEL_TASK_ID is
   // consistent with python/main.cc
   PYTHON_TOP_LEVEL_TASK_ID = 11111,
+  // Tensor Equal Task
+  TENSOR_EQUAL_TASK_ID,
 };
 
 enum ShardingID {
@@ -259,23 +300,33 @@ class Dropout;
 class ElementBinary;
 class ElementUnary;
 class Embedding;
+class Experts;
 class Flat;
 class Gather;
 class Group_by;
 class LayerNorm;
 class Linear;
 class MultiHeadAttention;
+class IncMultiHeadSelfAttention;
+class TreeIncMultiHeadSelfAttention;
 class Pool2D;
 class Reduce;
 class Reshape;
 class Softmax;
 class Split;
 class TopK;
+class ArgTopK;
 class Transpose;
+class RMSNorm;
+class BeamTopK;
+class SpecIncMultiHeadSelfAttention;
+class Sampling;
+class ArgMax;
 class Combine;
 class Repartition;
 class Reduction;
 class Replicate;
+class AllReduce;
 class FusedParallelOp;
 class ParallelOpInfo;
 
@@ -325,12 +376,13 @@ std::vector<ParallelTensorShape>
 
 class FFModel {
 public:
-  FFModel(FFConfig &config);
+  FFModel(FFConfig &config, bool cpu_offload = false);
 
   static constexpr float PROPAGATION_CHANCE = 0.25;
   static constexpr float CONTINUE_PROPAGATION_CHANCE = 0.75;
   static constexpr float PROPAGATION_SIZE_WEIGHT = 1.0;
 
+  bool cpu_offload;
   // C++ APIs for constructing models
   // Add an exp layer
   Tensor exp(const Tensor x, char const *name = NULL);
@@ -422,7 +474,7 @@ class FFModel {
                  char const *name = NULL);
   // Add an embedding layer
   Tensor embedding(const Tensor input,
-                   int num_entires,
+                   int num_entries,
                    int outDim,
                    AggrMode aggr,
                    DataType dtype = DT_FLOAT,
@@ -468,11 +520,12 @@ class FFModel {
                 PoolType type = POOL_MAX,
                 ActiMode activation = AC_MODE_NONE,
                 char const *name = NULL);
-  // Add a batch_norm layer
+  // Add a layer_norm layer
   Tensor layer_norm(const Tensor input,
                     std::vector<int> const &axes,
                     bool elementwise_affine,
                     float eps,
+                    DataType data_type = DT_NONE,
                     char const *name = NULL);
   // Add a batch_norm layer
   Tensor
@@ -483,12 +536,24 @@ class FFModel {
                       int a_seq_length_dim = -1,
                       int b_seq_length_dim = -1,
                       char const *name = nullptr);
+  // Add a root mean square layer
+  Tensor rms_norm(const Tensor input,
+                  float eps,
+                  int dim,
+                  DataType data_type = DT_NONE,
+                  char const *name = NULL);
+  // Add a beam search top k layer
+  Tensor beam_top_k(const Tensor input,
+                    int max_beam_size,
+                    bool sorted,
+                    char const *name = NULL);
+
   // Add a dense layer
   Tensor dense(const Tensor input,
                int outDim,
                ActiMode activation = AC_MODE_NONE,
                bool use_bias = true,
-               DataType data_type = DT_FLOAT,
+               DataType data_type = DT_NONE,
                Layer const *shared_op = NULL,
                Initializer *kernel_initializer = NULL,
                Initializer *bias_initializer = NULL,
@@ -500,6 +565,16 @@ class FFModel {
   // Add a concat layer
   Tensor
       concat(int n, Tensor const *tensors, int axis, char const *name = NULL);
+  // Add an experts layer
+  Tensor experts(
+      Tensor const *inputs,
+      int num_experts,
+      int experts_start_idx,
+      int experts_output_dim_size,
+      float alpha,
+      int experts_num_layers = 1,        // number of linear layers per expert
+      int experts_internal_dim_size = 0, // hidden dimension for internal layers
+      char const *name = NULL);
   // Add a mean layer
   Tensor mean(const Tensor input,
               std::vector<int> const &dims,
@@ -521,7 +596,10 @@ class FFModel {
   // Add a flat layer
   Tensor flat(const Tensor input, char const *name = NULL);
   // Add a softmax layer
-  Tensor softmax(const Tensor input, int dim = -1, char const *name = NULL);
+  Tensor softmax(const Tensor input,
+                 int dim = -1,
+                 DataType data_type = DT_NONE,
+                 char const *name = NULL);
   // Create input tensors and constants
   Tensor transpose(const Tensor input,
                    std::vector<int> const &perm,
@@ -539,6 +617,13 @@ class FFModel {
              int k,
              bool sorted,
              char const *name = NULL);
+  Tensor arg_top_k(const Tensor input,
+                   // Tensor *outputs,
+                   int k,
+                   bool sorted,
+                   char const *name = NULL);
+  Tensor argmax(const Tensor input, bool beam_search, char const *name = NULL);
+  Tensor sampling(const Tensor input, float top_p, char const *name = NULL);
   Tensor multihead_attention(const Tensor query,
                              const Tensor key,
                              const Tensor value,
@@ -550,8 +635,117 @@ class FFModel {
                              bool bias = true,
                              bool add_bias_kv = false,
                              bool add_zero_attn = false,
+                             DataType data_type = DT_NONE,
                              Initializer *kernel_initializer = NULL,
                              char const *name = NULL);
+  Tensor inc_multihead_self_attention(const Tensor input,
+                                      int embed_dim,
+                                      int num_heads,
+                                      int kdim = 0,
+                                      int vdim = 0,
+                                      float dropout = 0.0f,
+                                      bool bias = false,
+                                      bool add_bias_kv = false,
+                                      bool add_zero_attn = false,
+                                      DataType data_type = DT_NONE,
+                                      Initializer *kernel_initializer = NULL,
+                                      bool apply_rotary_embedding = false,
+                                      bool scaling_query = false,
+                                      float scaling_factor = 1.0f,
+                                      bool qk_prod_scaling = true,
+                                      char const *name = NULL);
+  Tensor
+      spec_inc_multihead_self_attention(const Tensor input,
+                                        int embed_dim,
+                                        int num_heads,
+                                        int kdim = 0,
+                                        int vdim = 0,
+                                        float dropout = 0.0f,
+                                        bool bias = false,
+                                        bool add_bias_kv = false,
+                                        bool add_zero_attn = false,
+                                        DataType data_type = DT_NONE,
+                                        Initializer *kernel_initializer = NULL,
+                                        bool apply_rotary_embedding = false,
+                                        bool scaling_query = false,
+                                        float scaling_factor = 1.0f,
+                                        bool qk_prod_scaling = true,
+                                        char const *name = NULL);
+  Tensor inc_multihead_self_attention_verify(
+      const Tensor input,
+      int embed_dim,
+      int num_heads,
+      int kdim = 0,
+      int vdim = 0,
+      float dropout = 0.0f,
+      bool bias = false,
+      bool add_bias_kv = false,
+      bool add_zero_attn = false,
+      DataType data_type = DT_NONE,
+      Initializer *kernel_initializer = NULL,
+      bool apply_rotary_embedding = false,
+      bool scaling_query = false,
+      float scaling_factor = 1.0f,
+      bool qk_prod_scaling = true,
+      char const *name = NULL);
+  Tensor inc_multiquery_self_attention(const Tensor input,
+                                       int embed_dim,
+                                       int num_q_heads,
+                                       int num_kv_heads,
+                                       int kdim = 0,
+                                       int vdim = 0,
+                                       float dropout = 0.0f,
+                                       bool bias = false,
+                                       bool add_bias_kv = false,
+                                       bool add_zero_attn = false,
+                                       DataType data_type = DT_NONE,
+                                       Initializer *kernel_initializer = NULL,
+                                       bool apply_rotary_embedding = false,
+                                       bool scaling_query = false,
+                                       float scaling_factor = 1.0f,
+                                       bool qk_prod_scaling = true,
+                                       char const *name = NULL);
+  Tensor
+      spec_inc_multiquery_self_attention(const Tensor input,
+                                         int embed_dim,
+                                         int num_q_heads,
+                                         int num_kv_heads,
+                                         int kdim = 0,
+                                         int vdim = 0,
+                                         float dropout = 0.0f,
+                                         bool bias = false,
+                                         bool add_bias_kv = false,
+                                         bool add_zero_attn = false,
+                                         DataType data_type = DT_NONE,
+                                         Initializer *kernel_initializer = NULL,
+                                         bool apply_rotary_embedding = false,
+                                         bool scaling_query = false,
+                                         float scaling_factor = 1.0f,
+                                         bool qk_prod_scaling = true,
+                                         char const *name = NULL);
+  Tensor inc_multiquery_self_attention_verify(
+      const Tensor input,
+      int embed_dim,
+      int num_q_heads,
+      int num_kv_heads,
+      int kdim = 0,
+      int vdim = 0,
+      float dropout = 0.0f,
+      bool bias = false,
+      bool add_bias_kv = false,
+      bool add_zero_attn = false,
+      DataType data_type = DT_NONE,
+      Initializer *kernel_initializer = NULL,
+      bool apply_rotary_embedding = false,
+      bool scaling_query = false,
+      float scaling_factor = 1.0f,
+      bool qk_prod_scaling = true,
+      char const *name = NULL);
+  // ========================================
+  // Inference APIs
+  // ========================================
+  GenerationResult generate(std::string const &text, int max_seq_length);
+
   Tensor create_tensor_legion_ordering(int num_dim,
                                        int const dims[],
                                        DataType data_type,
@@ -683,6 +877,7 @@ class FFModel {
     auto input_shapes = get_input_shape<typename T::Input>(input);
 
     if (!params.is_valid(input_shapes)) {
+      printf("!params.is_valid(input_shapes)\n");
       return PCG::Node::INVALID_NODE;
     }
 
@@ -690,7 +885,7 @@ class FFModel {
 
     std::pair<typename ToShape<typename T::Input>::type, Params> key{
         input_shapes, params};
-    auto &cache = get<std::unordered_map<
+    auto &cache = FlexFlow::get<std::unordered_map<
         std::pair<typename ToShape<typename T::Input>::type, Params>,
         T *>>(this->cached_ops);
     auto const &it = cache.find(key);
@@ -765,8 +960,14 @@ class FFModel {
                           std::vector<Legion::PhysicalRegion> const &regions,
                           Legion::Context ctx,
                           Legion::Runtime *runtime);
+  // ========================================
+  // Internal APIs that should not be invoked from applications
+  // ========================================
   void reset_metrics();
   void init_operators();
+  void init_operators_inference(
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs);
   void prefetch();
   void forward(int seq_length = -1);
   void compute_metrics();
@@ -783,6 +984,9 @@ class FFModel {
                LossType loss_type,
                std::vector<MetricsType> const &metrics,
                CompMode comp_mode = COMP_MODE_TRAINING);
+  void compile_inference();
+  void set_transformer_layer_id(int id);
+  void set_position_offset(int offset);
   void graph_optimize(size_t budget,
                       bool only_data_parallel,
                       std::unique_ptr<PCG::Graph> &best_graph,
@@ -839,6 +1043,9 @@ class FFModel {
 public:
   size_t op_global_guid, layer_global_guid;
   size_t tensor_global_guid, parallel_tensor_global_guid, node_global_guid;
+  size_t current_transformer_layer_id;
+  // positional embedding start offset
+  int position_offset;
   FFConfig config;
   FFIterationConfig iter_config;
   Optimizer *optimizer;
@@ -883,6 +1090,9 @@ class FFModel {
                          ElementUnary *>,
       std::unordered_map<std::pair<ParallelTensorShape, EmbeddingParams>,
                          Embedding *>,
+      std::unordered_map<
+          std::pair<std::vector<ParallelTensorShape>, ExpertsParams>,
+          Experts *>,
       std::unordered_map<std::pair<ParallelTensorShape, FlatParams>, Flat *>,
       std::unordered_map<
           std::pair<std::pair<ParallelTensorShape, ParallelTensorShape>,
@@ -903,6 +1113,21 @@ class FFModel {
                                               ParallelTensorShape>,
                                    MultiHeadAttentionParams>,
                          MultiHeadAttention *>,
+      std::unordered_map<
+          std::pair<ParallelTensorShape, IncMultiHeadSelfAttentionParams>,
+          IncMultiHeadSelfAttention *>,
+      std::unordered_map<std::pair<ParallelTensorShape, BeamTopKParams>,
+                         BeamTopK *>,
+      std::unordered_map<std::pair<ParallelTensorShape, SamplingParams>,
+                         Sampling *>,
+      std::unordered_map<std::pair<ParallelTensorShape, ArgMaxParams>,
+                         ArgMax *>,
+      std::unordered_map<
+          std::pair<ParallelTensorShape, SpecIncMultiHeadSelfAttentionParams>,
+          SpecIncMultiHeadSelfAttention *>,
+      std::unordered_map<
+          std::pair<ParallelTensorShape, TreeIncMultiHeadSelfAttentionParams>,
+          TreeIncMultiHeadSelfAttention *>,
       std::unordered_map<std::pair<ParallelTensorShape, ReduceParams>,
                          Reduce *>,
       std::unordered_map<std::pair<ParallelTensorShape, ReshapeParams>,
@@ -911,8 +1136,12 @@ class FFModel {
       std::unordered_map<std::pair<ParallelTensorShape, SoftmaxParams>,
                          Softmax *>,
       std::unordered_map<std::pair<ParallelTensorShape, TopKParams>, TopK *>,
+      std::unordered_map<std::pair<ParallelTensorShape, ArgTopKParams>,
+                         ArgTopK *>,
       std::unordered_map<std::pair<ParallelTensorShape, TransposeParams>,
                          Transpose *>,
+      std::unordered_map<std::pair<ParallelTensorShape, RMSNormParams>,
+                         RMSNorm *>,
       std::unordered_map<std::pair<ParallelTensorShape, RepartitionParams>,
                          Repartition *>,
       std::unordered_map<std::pair<ParallelTensorShape, ReplicateParams>,
@@ -921,6 +1150,8 @@ class FFModel {
                          Reduction *>,
       std::unordered_map<std::pair<ParallelTensorShape, CombineParams>,
                          Combine *>,
+      std::unordered_map<std::pair<ParallelTensorShape, AllReduceParams>,
+                         AllReduce *>,
       std::unordered_map<std::pair<ParallelTensorShape, FusedParallelOpParams>,
                          FusedParallelOp *>>
       cached_ops;
diff --git a/include/flexflow/operator.h b/include/flexflow/operator.h
index 3fd84ce55b..1b2fc7bbfc 100644
--- a/include/flexflow/operator.h
+++ b/include/flexflow/operator.h
@@ -1,6 +1,7 @@
 #ifndef _OPERATOR_H
 #define _OPERATOR_H
 
+#include "flexflow/batch_config.h"
 #include "flexflow/fftype.h"
 #include "flexflow/machine_view.h"
 #include "flexflow/parallel_tensor.h"
@@ -19,11 +20,33 @@ enum class MappingRecordType { INPUT_OUTPUT, INPUT_WEIGHT };
 
 enum class MappingOperation { PARTITION, REPLICATE };
 
+/** @brief  A class to keep track of a dimension relation between two tensors
+ * used by an operator.
+ *
+ * Dimension relations are one-to-one mappings between the dimensions of the
+ * input, weights, and output tensors of an operator. Introduced in the Unity
+ * paper, dimension relations allow FlexFlow to keep track of an operator's
+ * parallelization plans as part of the Parallel Computation Graph (PCG).
+ *
+ * Each ParallelDimMappingRecord only keeps track of a single dimension
+ * relation.
+ *
+ * ParallelDimMappingRecord objects must be initialized with a
+ * MappingRecordType, which can be INPUT_OUTPUT, if the ParallelDimMappingRecord
+ * is tracking a dimension relation between the input and the output tensor, or
+ * INPUT_WEIGHT, if the ParallelDimMappingRecord is tracking a dimension
+ * relation between the input tensor and the weights tensor.
+ *
+ */
 class ParallelDimMappingRecord {
 private:
   ParallelDimMappingRecord(MappingRecordType);
 
 public:
+  /**
+   * @brief We disable this constructor because ParallelDimMappingRecord objects
+   * must specify the MappingRecordType upon creation.
+   */
   ParallelDimMappingRecord() = delete;
 
   static ParallelDimMappingRecord input_output_record(
@@ -185,8 +208,22 @@ class Op {
   virtual bool get_weight_parameter(TNParameter, DIMParameter, int *) const;
   // Pure virtual functions that must be implemented
   virtual void init(FFModel const &) = 0;
+  virtual void init_inference(FFModel const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) {
+    assert(false);
+  };
   virtual void forward(FFModel const &) = 0;
   virtual void backward(FFModel const &) = 0;
+  // Pure virtual functions for inference
+  virtual Legion::FutureMap inference(FFModel const &,
+                                      BatchConfigFuture const &,
+                                      std::vector<ParallelTensor> const &,
+                                      std::vector<ParallelTensor> const &,
+                                      MachineView const *mv = nullptr) {
+    assert(false);
+  };
   virtual void print_layer(FFModel const &model) = 0;
   virtual bool measure_operator_cost(Simulator *sim,
                                      MachineView const &mv,
@@ -242,12 +279,21 @@ class Op {
 #endif
 protected:
   void set_argumentmap_for_init(FFModel const &ff, Legion::ArgumentMap &argmap);
+  void set_argumentmap_for_init_inference(FFModel const &ff,
+                                          Legion::ArgumentMap &argmap,
+                                          ParallelTensor const output0);
   void set_argumentmap_for_forward(FFModel const &ff,
                                    Legion::ArgumentMap &argmap);
+  void set_argumentmap_for_inference(FFModel const &ff,
+                                     Legion::ArgumentMap &argmap,
+                                     ParallelTensor const output0);
   void set_argumentmap_for_backward(FFModel const &ff,
                                     Legion::ArgumentMap &argmap);
   void set_opmeta_from_futuremap(FFModel const &ff,
                                  Legion::FutureMap const &fm);
+  void set_opmeta_from_futuremap_inference(FFModel const &ff,
+                                           Legion::FutureMap const &fm,
+                                           ParallelTensor const output0);
   void solve_parallel_dim_mappings(
       std::vector<ParallelDim const *> const &inputs,
       std::vector<ParallelDim *> const &weights,
@@ -267,8 +313,10 @@ class Op {
   ParallelParameter weights[MAX_NUM_WEIGHTS];
   bool trainableInputs[MAX_NUM_INPUTS];
   OpMeta *meta[MAX_NUM_WORKERS];
+  std::map<ParallelTensor, OpMeta *[MAX_NUM_WORKERS]> inference_meta;
   int numInputs, numWeights, numOutputs;
   bool profiling;
+  bool add_bias_only_once;
 #ifdef FF_USE_NCCL
   ncclUniqueId ncclId;
 #endif
diff --git a/include/flexflow/operator_params.h b/include/flexflow/operator_params.h
index 24c84a85ed..4f0432cb93 100644
--- a/include/flexflow/operator_params.h
+++ b/include/flexflow/operator_params.h
@@ -3,8 +3,11 @@
 
 #include "flexflow/ops/aggregate_params.h"
 #include "flexflow/ops/aggregate_spec_params.h"
+#include "flexflow/ops/arg_topk_params.h"
+#include "flexflow/ops/argmax_params.h"
 #include "flexflow/ops/attention_params.h"
 #include "flexflow/ops/batch_matmul_params.h"
+#include "flexflow/ops/beam_topk_params.h"
 #include "flexflow/ops/cast_params.h"
 #include "flexflow/ops/concat_params.h"
 #include "flexflow/ops/conv_2d_params.h"
@@ -12,18 +15,25 @@
 #include "flexflow/ops/element_binary_params.h"
 #include "flexflow/ops/element_unary_params.h"
 #include "flexflow/ops/embedding_params.h"
+#include "flexflow/ops/experts_params.h"
 #include "flexflow/ops/flat_params.h"
 #include "flexflow/ops/gather_params.h"
 #include "flexflow/ops/groupby_params.h"
+#include "flexflow/ops/inc_multihead_self_attention_params.h"
 #include "flexflow/ops/layer_norm_params.h"
 #include "flexflow/ops/linear_params.h"
 #include "flexflow/ops/pool_2d_params.h"
 #include "flexflow/ops/reduce_params.h"
 #include "flexflow/ops/reshape_params.h"
+#include "flexflow/ops/rms_norm_params.h"
+#include "flexflow/ops/sampling_params.h"
 #include "flexflow/ops/softmax_params.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention_params.h"
 #include "flexflow/ops/split_params.h"
 #include "flexflow/ops/topk_params.h"
 #include "flexflow/ops/transpose_params.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention_params.h"
+#include "flexflow/parallel_ops/allreduce_params.h"
 #include "flexflow/parallel_ops/combine_params.h"
 #include "flexflow/parallel_ops/fused_parallel_op_params.h"
 #include "flexflow/parallel_ops/partition_params.h"
@@ -51,17 +61,26 @@ using OperatorParameters = mp::variant<AggregateParams,
                                        LayerNormParams,
                                        LinearParams,
                                        MultiHeadAttentionParams,
+                                       IncMultiHeadSelfAttentionParams,
+                                       BeamTopKParams,
+                                       SpecIncMultiHeadSelfAttentionParams,
+                                       TreeIncMultiHeadSelfAttentionParams,
+                                       RMSNormParams,
                                        Pool2DParams,
                                        ReduceParams,
                                        ReshapeParams,
                                        SplitParams,
                                        TopKParams,
+                                       ArgTopKParams,
+                                       SamplingParams,
+                                       ArgMaxParams,
                                        SoftmaxParams,
                                        TransposeParams,
                                        RepartitionParams,
                                        ReplicateParams,
                                        ReductionParams,
                                        CombineParams,
+                                       AllReduceParams,
                                        FusedParallelOpParams>;
 
 tl::optional<OperatorParameters> get_op_parameters(Op const *op);
diff --git a/include/flexflow/ops/aggregate.h b/include/flexflow/ops/aggregate.h
index 4eeb695e92..3ba4f414d1 100644
--- a/include/flexflow/ops/aggregate.h
+++ b/include/flexflow/ops/aggregate.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_AGGREGATE_H_
 #define _FLEXFLOW_AGGREGATE_H_
 
+#include "flexflow/inference.h"
 #include "flexflow/model.h"
 #include "flexflow/ops/aggregate_params.h"
 
@@ -8,7 +9,7 @@ namespace FlexFlow {
 
 #define AGGREGATE_MAX_K 4
 #define AGGREGATE_MAX_BATCH_SIZE 64
-#define AGGREGATE_MAX_N 12
+#define AGGREGATE_MAX_N 128
 
 class AggregateMeta : public OpMeta {
 public:
@@ -26,7 +27,7 @@ class Aggregate : public Op {
             ParallelTensor const *inputs,
             int _n,
             float _lambda_bal,
-            char const *name);
+            char const *name = nullptr);
   Aggregate(FFModel &model,
             Aggregate const &other,
             std::vector<ParallelTensor> const &inputs);
@@ -35,7 +36,16 @@ class Aggregate : public Op {
             Input const &inputs,
             char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   void print_layer(FFModel const &model) override {
     assert(0);
@@ -81,6 +91,10 @@ class Aggregate : public Op {
                                       int const batch_size,
                                       int out_dim);
   void serialize(Legion::Serializer &s) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               Input const &inputs,
+                               int num_inputs);
   bool measure_operator_cost(Simulator *sim,
                              MachineView const &mv,
                              CostMetrics &cost_metrics) const override;
diff --git a/include/flexflow/ops/aggregate_spec.h b/include/flexflow/ops/aggregate_spec.h
index 8c1966e72a..4302dd0733 100644
--- a/include/flexflow/ops/aggregate_spec.h
+++ b/include/flexflow/ops/aggregate_spec.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_AGGREGATE_SPEC_H_
 #define _FLEXFLOW_AGGREGATE_SPEC_H_
 
+#include "flexflow/inference.h"
 #include "flexflow/model.h"
 #include "flexflow/ops/aggregate_spec_params.h"
 
@@ -27,7 +28,16 @@ class AggregateSpec : public Op {
                 float _lambda_bal,
                 char const *name);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   void print_layer(FFModel const &model) override {
     assert(0);
diff --git a/include/flexflow/ops/arg_topk.h b/include/flexflow/ops/arg_topk.h
new file mode 100644
index 0000000000..8b2d2aa11c
--- /dev/null
+++ b/include/flexflow/ops/arg_topk.h
@@ -0,0 +1,98 @@
+#ifndef _FLEXFLOW_ARG_TOPK_H_
+#define _FLEXFLOW_ARG_TOPK_H_
+
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include "flexflow/node.h"
+#include "flexflow/ops/arg_topk_params.h"
+
+namespace FlexFlow {
+
+class ArgTopKMeta : public OpMeta {
+public:
+  ArgTopKMeta(FFHandler handle, Op const *op);
+  bool sorted;
+};
+
+class ArgTopK : public Op {
+public:
+  using Params = ArgTopKParams;
+  using Input = ParallelTensor;
+  ArgTopK(FFModel &model,
+          LayerID const &layer_guid,
+          const ParallelTensor input,
+          int k,
+          bool sorted,
+          char const *name);
+  ArgTopK(FFModel &model,
+          LayerID const &layer_guid,
+          ArgTopK const &other,
+          const ParallelTensor input);
+  ArgTopK(FFModel &model,
+          Params const &params,
+          Input const input,
+          char const *name = nullptr);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static InferenceResult
+      inference_task(Legion::Task const *task,
+                     std::vector<Legion::PhysicalRegion> const &regions,
+                     Legion::Context ctx,
+                     Legion::Runtime *runtime);
+  void serialize(Legion::Serializer &s) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               ParallelTensor inputs[],
+                               int num_inputs);
+  Op *materialize(FFModel &ff,
+                  ParallelTensor inputs[],
+                  int num_inputs) const override;
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const override;
+  template <typename DT>
+  static void forward_kernel(ArgTopKMeta const *m,
+                             DT const *input_ptr,
+                             // float *output_ptr,
+                             int *indices_ptr,
+                             size_t batch_size,
+                             int length,
+                             int k,
+                             bool sorted,
+                             ffStream_t stream);
+  static void forward_kernel_wrapper(ArgTopKMeta const *m,
+                                     GenericTensorAccessorR const &input,
+                                     GenericTensorAccessorW const &indices,
+                                     int batch_size);
+  Params get_params() const;
+
+public:
+  int k;
+  bool sorted;
+};
+
+}; // namespace FlexFlow
+
+#endif
diff --git a/include/flexflow/ops/arg_topk_params.h b/include/flexflow/ops/arg_topk_params.h
new file mode 100644
index 0000000000..9d2a21034f
--- /dev/null
+++ b/include/flexflow/ops/arg_topk_params.h
@@ -0,0 +1,27 @@
+#ifndef _FLEXFLOW_ARG_TOPK_PARAMS_H
+#define _FLEXFLOW_ARG_TOPK_PARAMS_H
+
+#include "flexflow/ffconst.h"
+#include "flexflow/fftype.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct ArgTopKParams {
+  LayerID layer_guid;
+  int k;
+  bool sorted;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+bool operator==(ArgTopKParams const &, ArgTopKParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::ArgTopKParams> {
+  size_t operator()(FlexFlow::ArgTopKParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_ARG_TOPK_PARAMS_H
diff --git a/include/flexflow/ops/argmax.h b/include/flexflow/ops/argmax.h
new file mode 100644
index 0000000000..298059e3ed
--- /dev/null
+++ b/include/flexflow/ops/argmax.h
@@ -0,0 +1,112 @@
+#ifndef _FLEXFLOW_ARG_MAX_H_
+#define _FLEXFLOW_ARG_MAX_H_
+
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include "flexflow/node.h"
+#include "flexflow/ops/argmax_params.h"
+#include "flexflow/utils/memory_allocator.h"
+
+namespace FlexFlow {
+
+class ArgMaxMeta : public OpMeta {
+public:
+  bool beam_search;
+  float *probs;
+  void *d_temp_storage;
+  size_t temp_storage_bytes = 0;
+  int *d_offsets;
+  void *d_out;
+  Realm::RegionInstance reserveInst;
+  ArgMaxMeta(FFHandler handler,
+             Op const *op,
+             Legion::Domain const &input_domain,
+             Legion::Domain const &output_domain,
+             GenericTensorAccessorW input,
+             int batch_size,
+             int total_ele,
+             MemoryAllocator &gpu_mem_allocator);
+  ~ArgMaxMeta(void);
+};
+
+class ArgMax : public Op {
+public:
+  using Params = ArgMaxParams;
+  using Input = ParallelTensor;
+  ArgMax(FFModel &model,
+         const ParallelTensor input,
+         bool beam_search,
+         char const *name);
+  ArgMax(FFModel &model, ArgMax const &other, const ParallelTensor input);
+  ArgMax(FFModel &model,
+         Params const &params,
+         Input const input,
+         char const *name = nullptr);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static BeamInferenceResult
+      inference_task_beam(Legion::Task const *task,
+                          std::vector<Legion::PhysicalRegion> const &regions,
+                          Legion::Context ctx,
+                          Legion::Runtime *runtime);
+  static InferenceResult
+      inference_task_norm(Legion::Task const *task,
+                          std::vector<Legion::PhysicalRegion> const &regions,
+                          Legion::Context ctx,
+                          Legion::Runtime *runtime);
+  void serialize(Legion::Serializer &s) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               ParallelTensor inputs[],
+                               int num_inputs);
+  Op *materialize(FFModel &ff,
+                  ParallelTensor inputs[],
+                  int num_inputs) const override;
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const override;
+  template <typename DT>
+  static void forward_kernel(ArgMaxMeta const *m,
+                             DT *input_ptr,
+                             int *indices_ptr,
+                             float *prob_ptr,
+                             int *parent_ptr,
+                             int length,
+                             int batch_size,
+                             ffStream_t stream);
+  static void forward_kernel_wrapper(ArgMaxMeta const *m,
+                                     GenericTensorAccessorW const &input,
+                                     GenericTensorAccessorW const &indices,
+                                     GenericTensorAccessorW const &parent,
+                                     int batch_size);
+  Params get_params() const;
+
+public:
+  bool beam_search;
+};
+
+}; // namespace FlexFlow
+
+#endif
\ No newline at end of file
diff --git a/include/flexflow/ops/argmax_params.h b/include/flexflow/ops/argmax_params.h
new file mode 100644
index 0000000000..a8f629619f
--- /dev/null
+++ b/include/flexflow/ops/argmax_params.h
@@ -0,0 +1,24 @@
+#ifndef _FLEXFLOW_ARGMAX_PARAMS_H
+#define _FLEXFLOW_ARGMAX_PARAMS_H
+
+#include "flexflow/ffconst.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct ArgMaxParams {
+  bool beam_search;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+bool operator==(ArgMaxParams const &, ArgMaxParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::ArgMaxParams> {
+  size_t operator()(FlexFlow::ArgMaxParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_ARGMAX_PARAMS_H
\ No newline at end of file
diff --git a/include/flexflow/ops/attention.h b/include/flexflow/ops/attention.h
index 2903497af9..7f52e0dad4 100644
--- a/include/flexflow/ops/attention.h
+++ b/include/flexflow/ops/attention.h
@@ -3,6 +3,7 @@
 
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
+#include "flexflow/inference.h"
 #include "flexflow/layer.h"
 #include "flexflow/node.h"
 #include "flexflow/op_meta.h"
@@ -64,8 +65,17 @@ class MultiHeadAttention : public Op {
                                  Layer const *layer,
                                  std::vector<ParallelTensor> const &inputs);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override {
     assert(0);
   }
diff --git a/include/flexflow/ops/beam_topk.h b/include/flexflow/ops/beam_topk.h
new file mode 100644
index 0000000000..9466ba2a3b
--- /dev/null
+++ b/include/flexflow/ops/beam_topk.h
@@ -0,0 +1,112 @@
+#ifndef _FLEXFLOW_BEAM_TOPK_H_
+#define _FLEXFLOW_BEAM_TOPK_H_
+
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include "flexflow/node.h"
+#include "flexflow/ops/beam_topk_params.h"
+#include "flexflow/utils/memory_allocator.h"
+
+namespace FlexFlow {
+
+class BeamTopKMeta : public OpMeta {
+public:
+  BeamTopKMeta(FFHandler handle,
+               Op const *op,
+               MemoryAllocator &gpu_mem_allocator);
+  ~BeamTopKMeta(void);
+  bool sorted;
+  int max_beam_width;
+  int *parent_ids;
+  void *acc_probs;
+  int *block_start_index;
+  int *request_id;
+  int *tokens_per_request;
+  Realm::RegionInstance reserveInst;
+};
+
+class BeamTopK : public Op {
+public:
+  using Params = BeamTopKParams;
+  using Input = ParallelTensor;
+  BeamTopK(FFModel &model,
+           const ParallelTensor input,
+           LayerID const &_layer_guid,
+           int max_beam_width,
+           bool sorted,
+           char const *name);
+  BeamTopK(FFModel &model, BeamTopK const &other, const ParallelTensor input);
+  BeamTopK(FFModel &model,
+           Params const &params,
+           Input const input,
+           char const *name = nullptr);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static BeamInferenceResult
+      inference_task(Legion::Task const *task,
+                     std::vector<Legion::PhysicalRegion> const &regions,
+                     Legion::Context ctx,
+                     Legion::Runtime *runtime);
+  void serialize(Legion::Serializer &s) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               ParallelTensor inputs[],
+                               int num_inputs);
+  Op *materialize(FFModel &ff,
+                  ParallelTensor inputs[],
+                  int num_inputs) const override;
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const override;
+  template <typename DT>
+  static void forward_kernel(BeamTopKMeta const *m,
+                             BeamSearchBatchConfig const *bc,
+                             DT const *input_ptr,
+                             float *output_ptr,
+                             int *indices_ptr,
+                             int *parent_ptr,
+                             int batch_size,
+                             int length,
+                             bool sorted,
+                             ffStream_t stream);
+  static void forward_kernel_wrapper(BeamTopKMeta const *m,
+                                     BeamSearchBatchConfig const *bc,
+                                     GenericTensorAccessorR const &input,
+                                     float *output_ptr,
+                                     int *indices_ptr,
+                                     int *parent_ptr,
+                                     int batch_size,
+                                     int length,
+                                     bool sorted);
+  Params get_params() const;
+
+public:
+  bool sorted;
+  int max_beam_width;
+};
+
+}; // namespace FlexFlow
+
+#endif
diff --git a/include/flexflow/ops/beam_topk_params.h b/include/flexflow/ops/beam_topk_params.h
new file mode 100644
index 0000000000..c217b0f671
--- /dev/null
+++ b/include/flexflow/ops/beam_topk_params.h
@@ -0,0 +1,26 @@
+#ifndef _FLEXFLOW_BEAM_TOPK_PARAMS_H
+#define _FLEXFLOW_BEAM_TOPK_PARAMS_H
+
+#include "flexflow/ffconst.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct BeamTopKParams {
+  LayerID layer_guid;
+  bool sorted;
+  int max_beam_width;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+bool operator==(BeamTopKParams const &, BeamTopKParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::BeamTopKParams> {
+  size_t operator()(FlexFlow::BeamTopKParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_BEAM_TOPK_PARAMS_H
diff --git a/include/flexflow/ops/cast.h b/include/flexflow/ops/cast.h
index 2d69b9469e..a06f87b3c8 100644
--- a/include/flexflow/ops/cast.h
+++ b/include/flexflow/ops/cast.h
@@ -35,8 +35,17 @@ class Cast : public Op {
        Input const &input,
        char const *name = nullptr);
   void init(FFModel const &);
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &);
   void backward(FFModel const &);
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) {
     assert(0);
   }
diff --git a/include/flexflow/ops/element_binary.h b/include/flexflow/ops/element_binary.h
index cfacec50f7..4aa41ed9e4 100644
--- a/include/flexflow/ops/element_binary.h
+++ b/include/flexflow/ops/element_binary.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_ELEMENT_BINARY_H
 #define _FLEXFLOW_ELEMENT_BINARY_H
 
+#include "flexflow/inference.h"
 #include "flexflow/layer.h"
 #include "flexflow/node.h"
 #include "flexflow/operator.h"
@@ -14,6 +15,7 @@ class ElementBinary : public Op {
   using Input = std::pair<ParallelTensor, ParallelTensor>;
 
   ElementBinary(FFModel &model,
+                LayerID const &layer_guid,
                 OperatorType type,
                 const ParallelTensor x,
                 const ParallelTensor y,
@@ -22,11 +24,19 @@ class ElementBinary : public Op {
   ElementBinary(FFModel &model,
                 Params const &params,
                 Input const &inputs,
-                char const *name = nullptr,
-                bool inplace_a = false);
+                char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override {
     assert(0);
   }
@@ -53,6 +63,12 @@ class ElementBinary : public Op {
   bool measure_operator_cost(Simulator *sim,
                              MachineView const &pc,
                              CostMetrics &cost_metrics) const override;
+
+  void serialize(Legion::Serializer &) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               ParallelTensor inputs[],
+                               int num_inputs);
   Params get_params() const;
 
 public:
diff --git a/include/flexflow/ops/element_binary_params.h b/include/flexflow/ops/element_binary_params.h
index 5aa20e25a5..8b26877af2 100644
--- a/include/flexflow/ops/element_binary_params.h
+++ b/include/flexflow/ops/element_binary_params.h
@@ -7,7 +7,9 @@
 namespace FlexFlow {
 
 struct ElementBinaryParams {
+  LayerID layer_guid;
   OperatorType type;
+  bool inplace_a;
 
   bool is_valid(
       std::pair<ParallelTensorShape, ParallelTensorShape> const &) const;
diff --git a/include/flexflow/ops/element_unary.h b/include/flexflow/ops/element_unary.h
index 5291159aac..2df9ea61bc 100644
--- a/include/flexflow/ops/element_unary.h
+++ b/include/flexflow/ops/element_unary.h
@@ -3,6 +3,7 @@
 
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
+#include "flexflow/inference.h"
 #include "flexflow/layer.h"
 #include "flexflow/node.h"
 #include "flexflow/op_meta.h"
@@ -45,8 +46,17 @@ class ElementUnary : public Op {
                Input const x,
                char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override {
     assert(0);
   }
diff --git a/include/flexflow/ops/embedding.h b/include/flexflow/ops/embedding.h
index 91caf06af0..ae93ef4d1d 100644
--- a/include/flexflow/ops/embedding.h
+++ b/include/flexflow/ops/embedding.h
@@ -49,8 +49,17 @@ class Embedding : public Op {
             bool allocate_weights = false,
             char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   // void update(const FFModel&);
   void print_layer(FFModel const &model) override {
     assert(0);
diff --git a/include/flexflow/ops/experts.h b/include/flexflow/ops/experts.h
new file mode 100644
index 0000000000..d68957d890
--- /dev/null
+++ b/include/flexflow/ops/experts.h
@@ -0,0 +1,172 @@
+#pragma once
+
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include "flexflow/ops/experts_params.h"
+
+namespace FlexFlow {
+
+class ExpertsMeta : public OpMeta {
+public:
+  ExpertsMeta(FFHandler handler,
+              int _num_experts,
+              int _experts_start_idx,
+              int _data_dim,
+              int _out_dim,
+              int _experts_num_layers,
+              int _experts_internal_dim_size,
+              int _effective_batch_size,
+              int _num_chosen_experts,
+              float _alpha,
+              bool _use_bias,
+              ActiMode _activation);
+  ~ExpertsMeta(void);
+
+  // Thrust helper arrays
+  int *sorted_indices;
+  int *original_indices;
+  int *non_zero_expert_labels;
+  int *temp_sequence;
+  int *exp_local_label_to_index;
+  int *expert_start_indexes;
+  int *num_assignments_per_expert; // numbers of tokes assigned to each expert.
+                                   // Values may exceed the expert capacity
+  int *capped_num_assignments_per_expert;
+  int *destination_start_indices;
+  float const **token_idx_array;
+  float const **dev_weights;
+  float const **weight_idx_array1;
+  float const **weight_idx_array2;
+  float const **coefficient_idx_array;
+  float **output_idx_array;
+  float const **bias_idx_array1;
+  float const **bias_idx_array2;
+  float const *one_ptr;
+  float const **one_ptr_array;
+
+  // array of arrays to store cublasGemmBatchedEx outputs before aggregation
+  float **batch_outputs1;
+  float **batch_outputs2;
+  float **dev_batch_outputs1;
+  float **dev_batch_outputs2;
+
+  int num_experts;
+  int experts_start_idx;
+  int data_dim;
+  int out_dim;
+  int experts_num_layers;
+  int experts_internal_dim_size;
+  int effective_batch_size;
+  int num_chosen_experts;
+  int expert_capacity;
+  float alpha;
+  bool use_bias;
+  ActiMode activation;
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+  cudnnActivationDescriptor_t actiDesc;
+  cudnnTensorDescriptor_t resultTensorDesc1;
+  cudnnTensorDescriptor_t resultTensorDesc2;
+#else
+  miopenActivationDescriptor_t actiDesc;
+  miopenTensorDescriptor_t resultTensorDesc1;
+  miopenTensorDescriptor_t resultTensorDesc2;
+#endif
+};
+
+// definitions for the CUDA kernel
+#define MAX_BATCH_SIZE 1024 * 2 // 32 * 10
+#define MAX_EXPERTS_PER_BLOCK 32
+
+class Experts : public Op {
+public:
+  using Params = ExpertsParams;
+  using Input = std::vector<ParallelTensor>;
+  Experts(FFModel &model,
+          Params const &params,
+          Input const &inputs,
+          bool allocate_weights = false,
+          char const *name = nullptr);
+  Experts(FFModel &model,
+          LayerID const &layer_guid,
+          ParallelTensor const *inputs,
+          int _num_experts,
+          int _experts_start_idx,
+          int _experts_output_dim_size,
+          float _alpha,
+          int _experts_num_layers,
+          int _experts_internal_dim_size,
+          bool _use_bias,
+          ActiMode _activation,
+          bool allocate_weights,
+          char const *name = nullptr);
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override;
+  void serialize(Legion::Serializer &) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               Input const &inputs,
+                               int num_inputs);
+  Params get_params() const;
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void forward_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void forward_kernel_wrapper(ExpertsMeta const *m,
+                                     float const *input,
+                                     int const *indices,
+                                     float const *topk_gate_preds,
+                                     float *output,
+                                     float const *weights,
+                                     float const *biases,
+                                     int num_active_tokens,
+                                     int chosen_experts,
+                                     int batch_size,
+                                     int out_dim);
+  static void backward_task(Legion::Task const *task,
+                            std::vector<Legion::PhysicalRegion> const &regions,
+                            Legion::Context ctx,
+                            Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const override;
+
+public:
+  int num_experts;
+  int experts_start_idx;
+  int experts_output_dim_size;
+  int data_dim;
+  int out_dim;
+  int effective_batch_size;
+  int num_chosen_experts;
+  float alpha;
+  int experts_num_layers;
+  int experts_internal_dim_size;
+  bool use_bias;
+  ActiMode activation;
+};
+
+}; // namespace FlexFlow
diff --git a/include/flexflow/ops/experts_params.h b/include/flexflow/ops/experts_params.h
new file mode 100644
index 0000000000..b6ba88a96e
--- /dev/null
+++ b/include/flexflow/ops/experts_params.h
@@ -0,0 +1,31 @@
+#pragma once
+
+#include "flexflow/operator.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct ExpertsParams {
+  LayerID layer_guid;
+  int num_experts;
+  int experts_start_idx;
+  int experts_output_dim_size;
+  float alpha;
+  int experts_num_layers;
+  int experts_internal_dim_size;
+  bool use_bias;
+  ActiMode activation;
+
+  bool is_valid(std::vector<ParallelTensorShape> const &) const;
+};
+
+bool operator==(ExpertsParams const &, ExpertsParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::ExpertsParams> {
+  size_t operator()(FlexFlow::ExpertsParams const &) const;
+};
+} // namespace std
diff --git a/include/flexflow/ops/fused.h b/include/flexflow/ops/fused.h
index 87d35da902..87c2201c28 100644
--- a/include/flexflow/ops/fused.h
+++ b/include/flexflow/ops/fused.h
@@ -29,8 +29,17 @@ class FusedOp : public Op {
     return ParallelTensor();
   }
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override {
     assert(0);
   }
@@ -38,6 +47,10 @@ class FusedOp : public Op {
                            std::vector<Legion::PhysicalRegion> const &regions,
                            Legion::Context ctx,
                            Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
   static void forward_task(Legion::Task const *task,
                            std::vector<Legion::PhysicalRegion> const &regions,
                            Legion::Context ctx,
diff --git a/include/flexflow/ops/groupby.h b/include/flexflow/ops/groupby.h
index 4a15f6f439..ec6cdfb9ab 100644
--- a/include/flexflow/ops/groupby.h
+++ b/include/flexflow/ops/groupby.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_GROUPBY_H_
 #define _FLEXFLOW_GROUPBY_H_
 
+#include "flexflow/inference.h"
 #include "flexflow/model.h"
 #include "flexflow/node.h"
 #include "flexflow/ops/groupby_params.h"
@@ -9,8 +10,9 @@ namespace FlexFlow {
 
 class GroupByMeta : public OpMeta {
 public:
-  GroupByMeta(FFHandler handle, int n);
+  GroupByMeta(FFHandler handle, int n, float _alpha);
   ~GroupByMeta(void);
+  float alpha;
   float **dev_region_ptrs;
 };
 
@@ -33,8 +35,17 @@ class Group_by : public Op {
            Input const &inputs,
            char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override {
     assert(0);
   }
@@ -62,26 +73,22 @@ class Group_by : public Op {
   Op *materialize(FFModel &ff,
                   ParallelTensor inputs[],
                   int num_inputs) const override;
-  static void
-      forward_kernel_wrapper(GroupByMeta const *m,
-                             float const *input,
-                             int const *exp_assign,
-                             float **outputs,
-                             int n,       // num experts
-                             int k,       // chosen experts
-                             float alpha, // factor additional memory assigned
-                             int batch_size,
-                             int data_dim);
-  static void
-      backward_kernel_wrapper(GroupByMeta const *m,
-                              float *input_grad,
-                              int const *exp_assign,
-                              float **output_grads,
-                              int n,       // num experts
-                              int k,       // chosen experts
-                              float alpha, // factor additional memory assigned
-                              int batch_size,
-                              int data_dim);
+  static void forward_kernel_wrapper(GroupByMeta const *m,
+                                     float const *input,
+                                     int const *exp_assign,
+                                     float **outputs,
+                                     int n, // num experts
+                                     int k, // chosen experts
+                                     int batch_size,
+                                     int data_dim);
+  static void backward_kernel_wrapper(GroupByMeta const *m,
+                                      float *input_grad,
+                                      int const *exp_assign,
+                                      float **output_grads,
+                                      int n, // num experts
+                                      int k, // chosen experts
+                                      int batch_size,
+                                      int data_dim);
   bool measure_operator_cost(Simulator *sim,
                              MachineView const &pc,
                              CostMetrics &cost_metrics) const override;
diff --git a/include/flexflow/ops/inc_multihead_self_attention.h b/include/flexflow/ops/inc_multihead_self_attention.h
new file mode 100644
index 0000000000..91621074b3
--- /dev/null
+++ b/include/flexflow/ops/inc_multihead_self_attention.h
@@ -0,0 +1,199 @@
+#ifndef _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_H
+#define _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_H
+
+#include "flexflow/accessor.h"
+#include "flexflow/device.h"
+#include "flexflow/fftype.h"
+#include "flexflow/inference.h"
+#include "flexflow/layer.h"
+#include "flexflow/node.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/operator.h"
+#include "flexflow/ops/inc_multihead_self_attention_params.h"
+#include "flexflow/utils/memory_allocator.h"
+#include "math.h"
+#include <cfloat>
+#include <complex>
+
+namespace FlexFlow {
+
+class IncMultiHeadSelfAttentionMeta;
+
+class IncMultiHeadSelfAttention : public Op {
+public:
+  using Params = IncMultiHeadSelfAttentionParams;
+  using Input = ParallelTensor;
+
+  IncMultiHeadSelfAttention(FFModel &model,
+                            LayerID const &layer_guid,
+                            const ParallelTensor _input,
+                            int _embed_dim,
+                            int _num_q_heads,
+                            int _num_kv_heads,
+                            int _kdim,
+                            int _vdim,
+                            float _dropout,
+                            bool _bias,
+                            bool _add_bias_kv,
+                            bool _add_zero_attn,
+                            bool _apply_rotary_embedding,
+                            bool _scaling_query,
+                            float _scaling_factor,
+                            bool _qk_prod_scaling,
+                            bool allocate_weights,
+                            DataType _quantization_type,
+                            bool _offload,
+                            int _tensor_parallelism_degree,
+                            char const *name);
+  IncMultiHeadSelfAttention(FFModel &model,
+                            const ParallelTensor _input,
+                            const ParallelTensor _weight,
+                            int _embed_dim,
+                            int _num_q_heads,
+                            int _num_kv_heads,
+                            int _kdim,
+                            int _vdim,
+                            float _dropout,
+                            bool _bias,
+                            bool _add_bias_kv,
+                            bool _add_zero_attn,
+                            bool _apply_rotary_embedding,
+                            bool _scaling_query,
+                            float _scaling_factor,
+                            bool _qk_prod_scaling,
+                            bool allocate_weights,
+                            DataType _quantization_type,
+                            bool _offload,
+                            int _tensor_parallelism_degree,
+                            char const *name);
+  IncMultiHeadSelfAttention(FFModel &model,
+                            IncMultiHeadSelfAttention const &other,
+                            const ParallelTensor input,
+                            bool allocate_weights);
+  IncMultiHeadSelfAttention(FFModel &model,
+                            Params const &params,
+                            Input const &inputs,
+                            bool allocate_weights = false,
+                            char const *name = nullptr);
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  bool get_int_parameter(PMParameter, int *) const override;
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &mv,
+                             CostMetrics &cost_metrics) const override;
+
+  static void inference_kernel_wrapper(IncMultiHeadSelfAttentionMeta const *m,
+                                       BatchConfig const *bc,
+                                       int shard_id,
+                                       GenericTensorAccessorR const &input,
+                                       GenericTensorAccessorR const &weight,
+                                       GenericTensorAccessorW const &output,
+                                       GenericTensorAccessorR const &bias);
+  Params get_params() const;
+
+public:
+  int num_q_heads, num_kv_heads, tensor_parallelism_degree;
+  float dropout, scaling_factor;
+  bool bias;
+  bool add_bias_kv, add_zero_attn, apply_rotary_embedding, scaling_query,
+      qk_prod_scaling;
+  int qSize, kSize, vSize, qProjSize, kProjSize, vProjSize, oProjSize;
+  int qoSeqLength, kvSeqLength;
+  DataType quantization_type;
+  bool offload;
+};
+
+class IncMultiHeadSelfAttentionMeta : public OpMeta {
+public:
+  IncMultiHeadSelfAttentionMeta(FFHandler handler,
+                                IncMultiHeadSelfAttention const *attn,
+                                GenericTensorAccessorR const &weight,
+                                MemoryAllocator &gpu_mem_allocator,
+                                int num_samples,
+                                int _num_q_heads,
+                                int _num_kv_heads);
+  IncMultiHeadSelfAttentionMeta(FFHandler handler,
+                                InferenceMode infer_mode,
+                                Op const *attn,
+                                int _qSize,
+                                int _kSize,
+                                int _vSize,
+                                int _qProjSize,
+                                int _kProjSize,
+                                int _vProjSize,
+                                int _oProjSize,
+                                bool _apply_rotary_embedding,
+                                bool _bias,
+                                bool _scaling_query,
+                                bool _qk_prod_scaling,
+                                bool _add_bias_kv,
+                                float _scaling_factor,
+                                GenericTensorAccessorR const &weight,
+                                MemoryAllocator &gpu_mem_allocator,
+                                int num_samples,
+                                int _global_num_q_heads,
+                                int _global_num_kv_heads,
+                                int _num_q_heads,
+                                int _num_kv_heads,
+                                DataType _quantization_type,
+                                bool _offload);
+  ~IncMultiHeadSelfAttentionMeta(void);
+
+public:
+  Realm::RegionInstance reserveInst;
+  size_t weights_params, weightSize, biasSize, reserveSpaceSize,
+      quantized_weightSize;
+  int qSize, kSize, vSize, qProjSize, kProjSize, vProjSize, oProjSize;
+  int global_num_q_heads, global_num_kv_heads, num_q_heads, num_kv_heads;
+  bool *has_load_weights;
+  bool *apply_rotary_embedding;
+  bool *bias;
+  bool *scaling_query;
+  bool *qk_prod_scaling;
+  float scaling_factor;
+#ifdef INFERENCE_TESTS
+  float *kcache, *vcache;
+#endif
+  void *weight_ptr, *bias_ptr; // for weight offload
+  void *devQKVProjArray, *keyCache, *valueCache;
+  void *qk_prods, *qk_prods_softmax;
+  void *attn_heads, *W_out_contiguous;
+  char *quantized_weight_ptr;
+  BatchConfig::PerTokenInfo *token_infos;
+  DataType quantization_type;
+  bool offload;
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+  cudnnTensorDescriptor_t qk_tensor;
+  cuFloatComplex *complex_input;
+#endif
+};
+
+}; // namespace FlexFlow
+
+#endif // _FLEXFLOW_ATTENTION_H
diff --git a/include/flexflow/ops/inc_multihead_self_attention_params.h b/include/flexflow/ops/inc_multihead_self_attention_params.h
new file mode 100644
index 0000000000..be38b9ab1b
--- /dev/null
+++ b/include/flexflow/ops/inc_multihead_self_attention_params.h
@@ -0,0 +1,33 @@
+#ifndef _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_PARAMS_H
+#define _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_PARAMS_H
+
+#include "flexflow/fftype.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct IncMultiHeadSelfAttentionParams {
+  LayerID layer_guid;
+  int embed_dim, num_q_heads, kdim, vdim, num_kv_heads,
+      tensor_parallelism_degree;
+  float dropout, scaling_factor;
+  bool bias, add_bias_kv, add_zero_attn, apply_rotary_embedding, scaling_query,
+      qk_prod_scaling;
+  DataType quantization_type;
+  bool offload;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+
+bool operator==(IncMultiHeadSelfAttentionParams const &,
+                IncMultiHeadSelfAttentionParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::IncMultiHeadSelfAttentionParams> {
+  size_t operator()(FlexFlow::IncMultiHeadSelfAttentionParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_PARAMS_H
diff --git a/include/flexflow/ops/kernels/decompress_kernels.h b/include/flexflow/ops/kernels/decompress_kernels.h
new file mode 100644
index 0000000000..7cfedd6265
--- /dev/null
+++ b/include/flexflow/ops/kernels/decompress_kernels.h
@@ -0,0 +1,43 @@
+#ifndef _FLEXFLOW_DECOMPRESS_KERNELS_H
+#define _FLEXFLOW_DECOMPRESS_KERNELS_H
+
+#include "flexflow/device.h"
+
+namespace FlexFlow {
+namespace Kernels {
+
+template <typename DT>
+__global__ void decompress_int4_general_weights(char const *input_weight_ptr,
+                                                DT *weight_ptr,
+                                                int in_dim,
+                                                int valueSize);
+template <typename DT>
+__global__ void decompress_int8_general_weights(char const *input_weight_ptr,
+                                                DT *weight_ptr,
+                                                int in_dim,
+                                                int valueSize);
+
+template <typename DT>
+__global__ void decompress_int4_attention_weights(char *input_weight_ptr,
+                                                  DT *weight_ptr,
+                                                  int qProjSize,
+                                                  int qSize,
+                                                  int num_heads);
+
+template <typename DT>
+__global__ void decompress_int8_attention_weights(char *input_weight_ptr,
+                                                  DT *weight_ptr,
+                                                  int qProjSize,
+                                                  int qSize,
+                                                  int num_heads);
+// template <typename T1, typename T2>
+// void decompress_weight_bias(T1 *input_weight_ptr,
+//                             T2 *weight_ptr,
+//                             T2 *params,
+//                             int group_size,
+//                             int tensor_size);
+
+} // namespace Kernels
+} // namespace FlexFlow
+
+#endif // _FLEXFLOW_DECOMPRESS_KERNELS_H
diff --git a/include/flexflow/ops/kernels/element_binary_kernels.h b/include/flexflow/ops/kernels/element_binary_kernels.h
index 529859195e..b0c596301b 100644
--- a/include/flexflow/ops/kernels/element_binary_kernels.h
+++ b/include/flexflow/ops/kernels/element_binary_kernels.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_OPS_KERNELS_ELEMENT_BINARY_KERNELS_H
 #define _FLEXFLOW_OPS_KERNELS_ELEMENT_BINARY_KERNELS_H
 
+#include "flexflow/accessor.h"
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
 #include "flexflow/op_meta.h"
@@ -9,7 +10,7 @@ namespace FlexFlow {
 
 class ElementBinaryMeta : public OpMeta {
 public:
-  ElementBinaryMeta(FFHandler handle);
+  ElementBinaryMeta(FFHandler handle, Op const *op);
 #if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
   cudnnTensorDescriptor_t input1Tensor, input2Tensor, outputTensor;
   cudnnOpTensorDescriptor_t opDesc;
@@ -34,9 +35,9 @@ void init_kernel(ElementBinaryMeta *m,
                  Legion::Domain const &output_domain);
 
 void forward_kernel_wrapper(ElementBinaryMeta const *m,
-                            float const *in1_ptr,
-                            float const *in2_ptr,
-                            float *out_ptr);
+                            GenericTensorAccessorR const &in1,
+                            GenericTensorAccessorR const &in2,
+                            GenericTensorAccessorW const &out);
 
 void backward_kernel_wrapper(ElementBinaryMeta const *m,
                              float const *out_grad_ptr,
@@ -47,10 +48,11 @@ void backward_kernel_wrapper(ElementBinaryMeta const *m,
 
 namespace Internal {
 
+template <typename DT>
 void forward_kernel(ElementBinaryMeta const *m,
-                    float const *in1_ptr,
-                    float const *in2_ptr,
-                    float *out_ptr,
+                    DT const *in1_ptr,
+                    DT const *in2_ptr,
+                    DT *out_ptr,
                     ffStream_t stream);
 void backward_kernel(ElementBinaryMeta const *m,
                      float const *out_grad_ptr,
@@ -65,4 +67,4 @@ void backward_kernel(ElementBinaryMeta const *m,
 } // namespace Kernels
 } // namespace FlexFlow
 
-#endif // _FLEXFLOW_OPS_KERNELS_ELEMENT_BINARY_KERNELS_H
\ No newline at end of file
+#endif // _FLEXFLOW_OPS_KERNELS_ELEMENT_BINARY_KERNELS_H
diff --git a/include/flexflow/ops/kernels/inc_multihead_self_attention_kernels.h b/include/flexflow/ops/kernels/inc_multihead_self_attention_kernels.h
new file mode 100644
index 0000000000..6b294bc211
--- /dev/null
+++ b/include/flexflow/ops/kernels/inc_multihead_self_attention_kernels.h
@@ -0,0 +1,68 @@
+#ifndef _FLEXFLOW_OPS_KERNELS_INC_MULTIHEAD_SELF_ATTENTION_KERNELS_H
+#define _FLEXFLOW_OPS_KERNELS_INC_MULTIHEAD_SELF_ATTENTION_KERNELS_H
+
+#include "flexflow/batch_config.h"
+#include "flexflow/device.h"
+#include "flexflow/fftype.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
+
+namespace FlexFlow {
+namespace Kernels {
+namespace IncMultiHeadAttention {
+
+template <typename DT>
+__global__ void apply_proj_bias_w(DT *input_ptr,
+                                  DT const *bias_ptr,
+                                  int num_tokens,
+                                  int qkv_weight_size,
+                                  int oProjSize);
+
+template <typename DT>
+__global__ void apply_proj_bias_qkv(DT *input_ptr,
+                                    DT const *bias_ptr,
+                                    int shard_id,
+                                    int num_tokens,
+                                    int qProjSize,
+                                    int kProjSize,
+                                    int vProjSize,
+                                    int num_heads,
+                                    int num_kv_heads,
+                                    bool scaling_query,
+                                    float scaling_factor);
+
+template <typename DT>
+__global__ void
+    apply_rotary_embedding(DT *input_ptr,
+                           cuFloatComplex *complex_input,
+                           BatchConfig::PerTokenInfo const *tokenInfos,
+                           int qProjSize,
+                           int kProjSize,
+                           int num_heads,
+                           int num_tokens,
+                           int num_kv_heads,
+                           int q_block_size,
+                           int k_block_size,
+                           int q_array_size,
+                           bool q_tensor);
+
+template <typename DT>
+void compute_qkv_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                        BatchConfig const *bc,
+                        int shard_id,
+                        DT const *input_ptr,
+                        DT const *weight_ptr,
+                        DT *output_ptr,
+                        DT const *bias_ptr,
+                        cudaStream_t stream);
+
+template <typename DT>
+void pre_build_weight_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                             GenericTensorAccessorR const weight,
+                             DataType data_type,
+                             cudaStream_t stream);
+} // namespace IncMultiHeadAttention
+} // namespace Kernels
+} // namespace FlexFlow
+
+#endif // _FLEXFLOW_OPS_KERNELS_INC_MULTIHEAD_SELF_ATTENTION_KERNELS_H
diff --git a/include/flexflow/ops/kernels/linear_kernels.h b/include/flexflow/ops/kernels/linear_kernels.h
index 6ca9fb89ac..bbebe3c79b 100644
--- a/include/flexflow/ops/kernels/linear_kernels.h
+++ b/include/flexflow/ops/kernels/linear_kernels.h
@@ -4,12 +4,18 @@
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
 #include "flexflow/op_meta.h"
+#include "flexflow/ops/linear.h"
 
 namespace FlexFlow {
 
 class LinearMeta : public OpMeta {
 public:
-  LinearMeta(FFHandler handle, int batch_size);
+  LinearMeta(FFHandler handle,
+             int batch_size,
+             Linear const *li,
+             MemoryAllocator gpu_mem_allocator,
+             int weightSize);
+  ~LinearMeta(void);
 #if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
   cudnnTensorDescriptor_t outputTensor;
   cudnnActivationDescriptor_t actiDesc;
@@ -17,13 +23,19 @@ class LinearMeta : public OpMeta {
   miopenTensorDescriptor_t outputTensor;
   miopenActivationDescriptor_t actiDesc;
 #endif
-  float const *one_ptr;
+  void *one_ptr;
+  void *weight_ptr;
+  DataType weight_ptr_type;
+  DataType quantization_type;
+  bool offload;
+  char *quantized_weight_ptr;
+  size_t quantized_weightSize;
   ActiMode activation;
   RegularizerMode kernel_reg_type;
   float kernel_reg_lambda;
-  bool use_bias;
-  DataType input_type, weight_type, output_type;
+  bool use_bias, add_bias_only_once;
   char op_name[MAX_OPNAME];
+  Realm::RegionInstance reserveInst;
 };
 
 namespace Kernels {
@@ -51,6 +63,7 @@ void backward_kernel_wrapper(LinearMeta const *m,
 bool use_activation(ActiMode mode);
 
 namespace Internal {
+template <typename DT>
 void forward_kernel(LinearMeta const *m,
                     void const *input_ptr,
                     void *output_ptr,
@@ -60,6 +73,7 @@ void forward_kernel(LinearMeta const *m,
                     int out_dim,
                     int batch_size,
                     ffStream_t stream);
+template <typename DT>
 void backward_kernel(LinearMeta const *m,
                      void const *input_ptr,
                      void *input_grad_ptr,
@@ -72,6 +86,8 @@ void backward_kernel(LinearMeta const *m,
                      int out_dim,
                      int batch_size,
                      ffStream_t stream);
+template <typename DT>
+__global__ void build_one_ptr(DT *one_ptr, int batch_size);
 } // namespace Internal
 } // namespace Linear
 } // namespace Kernels
diff --git a/include/flexflow/ops/kernels/rms_norm_kernels.h b/include/flexflow/ops/kernels/rms_norm_kernels.h
new file mode 100644
index 0000000000..2063777ef1
--- /dev/null
+++ b/include/flexflow/ops/kernels/rms_norm_kernels.h
@@ -0,0 +1,54 @@
+#ifndef _FLEXFLOW_OPS_KERNELS_RMSNORM_KERNELS_H
+#define _FLEXFLOW_OPS_KERNELS_RMSNORM_KERNELS_H
+
+#include "flexflow/accessor.h"
+#include "flexflow/device.h"
+#include "flexflow/fftype.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/utils/memory_allocator.h"
+
+namespace FlexFlow {
+using Legion::coord_t;
+
+class RMSNorm;
+
+class RMSNormMeta : public OpMeta {
+public:
+  RMSNormMeta(FFHandler handler,
+              RMSNorm const *rms,
+              MemoryAllocator &gpu_mem_allocator);
+  ~RMSNormMeta(void);
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+  cudnnTensorDescriptor_t inputTensor, outputTensor;
+  cudnnReduceTensorDescriptor_t reduceDesc;
+#else
+  miopenTensorDescriptor_t inputTensor, outputTensor;
+  miopenReduceTensorDescriptor_t reduceDesc;
+#endif
+
+public:
+  float eps;
+  void *rms_ptr;
+  void *norm_ptr;
+
+  float alpha;
+  float beta;
+
+  int in_dim;
+  int batch_size;
+  int num_elements;
+  char op_name[MAX_OPNAME];
+  Realm::RegionInstance reserveInst;
+};
+
+namespace Kernels {
+namespace RMSNorm {
+void forward_kernel_wrapper(RMSNormMeta const *m,
+                            GenericTensorAccessorR const &input,
+                            GenericTensorAccessorR const &weight,
+                            GenericTensorAccessorW const &output);
+} // namespace RMSNorm
+} // namespace Kernels
+} // namespace FlexFlow
+
+#endif // _FLEXFLOW_OPS_KERNELS_RMSNORM_KERNELS_H
diff --git a/include/flexflow/ops/kernels/softmax_kernels.h b/include/flexflow/ops/kernels/softmax_kernels.h
index 81b34d8558..14c07414e9 100644
--- a/include/flexflow/ops/kernels/softmax_kernels.h
+++ b/include/flexflow/ops/kernels/softmax_kernels.h
@@ -21,27 +21,31 @@ class SoftmaxMeta : public OpMeta {
   bool profiling;
   int dim;
   char op_name[MAX_OPNAME];
+  DataType input_type, output_type;
 };
 
 namespace Kernels {
 namespace Softmax {
-
+template <typename DT>
 void forward_kernel_wrapper(SoftmaxMeta const *m,
-                            float const *input_ptr,
-                            float *output_ptr);
-
+                            DT const *input_ptr,
+                            DT *output_ptr);
+template <typename DT>
 void backward_kernel_wrapper(SoftmaxMeta const *m,
-                             float *input_grad_ptr,
-                             float const *output_grad_ptr,
+                             DT *input_grad_ptr,
+                             DT const *output_grad_ptr,
                              size_t num_elements);
 
 namespace Internal {
+template <typename DT>
 void forward_kernel(SoftmaxMeta const *m,
-                    float const *input_ptr,
-                    float *output_ptr,
+                    DT const *input_ptr,
+                    DT *output_ptr,
                     ffStream_t stream);
-void backward_kernel(float *input_grad_ptr,
-                     float const *output_grad_ptr,
+
+template <typename DT>
+void backward_kernel(DT *input_grad_ptr,
+                     DT const *output_grad_ptr,
                      size_t num_elements,
                      ffStream_t stream);
 } // namespace Internal
diff --git a/include/flexflow/ops/layer_norm.h b/include/flexflow/ops/layer_norm.h
index 8273b9ab52..cb977fc6a6 100644
--- a/include/flexflow/ops/layer_norm.h
+++ b/include/flexflow/ops/layer_norm.h
@@ -1,7 +1,8 @@
 #pragma once
 
+#include "flexflow/inference.h"
 #include "flexflow/model.h"
-
+#include "flexflow/utils/memory_allocator.h"
 namespace FlexFlow {
 
 class LayerNormMeta;
@@ -24,8 +25,17 @@ class LayerNorm : public Op {
             bool allocate_weights,
             char const *name);
   void init(FFModel const &);
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &);
   void backward(FFModel const &);
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) {
     assert(0);
   }
@@ -63,15 +73,14 @@ class LayerNorm : public Op {
   static void forward_kernel(LayerNormMeta const *m,
                              T const *input_ptr,
                              T *output_ptr,
-                             T *gamma_ptr,
-                             T *beta_ptr,
+                             T const *gamma_ptr,
+                             T const *beta_ptr,
                              ffStream_t stream);
-  template <typename T>
   static void forward_kernel_wrapper(LayerNormMeta const *m,
-                                     T const *input_ptr,
-                                     T *output_ptr,
-                                     T *gamma_ptr,
-                                     T *beta_ptr);
+                                     GenericTensorAccessorR const &input,
+                                     GenericTensorAccessorW &output,
+                                     GenericTensorAccessorR const &gamma,
+                                     GenericTensorAccessorR const &beta);
   template <typename T>
   static void backward_kernel(LayerNormMeta const *m,
                               T const *output_grad_ptr,
@@ -99,14 +108,18 @@ class LayerNorm : public Op {
 
 class LayerNormMeta : public OpMeta {
 public:
-  LayerNormMeta(FFHandler handle, LayerNorm const *ln);
+  LayerNormMeta(FFHandler handle,
+                LayerNorm const *ln,
+                MemoryAllocator &gpu_mem_allocator);
+  ~LayerNormMeta(void);
 
 public:
   bool elementwise_affine;
   int64_t effective_batch_size, effective_num_elements;
   float eps;
-  float *mean_ptr, *rstd_ptr, *ds_ptr, *db_ptr, *scale_ptr, *bias_ptr;
+  void *mean_ptr, *rstd_ptr, *ds_ptr, *db_ptr, *scale_ptr, *bias_ptr;
   char op_name[MAX_OPNAME];
+  Realm::RegionInstance reserveInst;
 };
 
 }; // namespace FlexFlow
diff --git a/include/flexflow/ops/linear.h b/include/flexflow/ops/linear.h
index 286bcdf717..025674c7ba 100644
--- a/include/flexflow/ops/linear.h
+++ b/include/flexflow/ops/linear.h
@@ -1,9 +1,11 @@
 #ifndef _FLEXFLOW_LINEAR_H
 #define _FLEXFLOW_LINEAR_H
 
+#include "flexflow/inference.h"
 #include "flexflow/node.h"
 #include "flexflow/operator.h"
 #include "flexflow/ops/linear_params.h"
+#include "flexflow/utils/memory_allocator.h"
 
 namespace FlexFlow {
 
@@ -24,6 +26,8 @@ class Linear : public Op {
          float kernel_reg_lambda,
          bool _use_bias,
          DataType _data_type,
+         DataType _quantization_type,
+         bool offload,
          bool allocate_weights,
          char const *name);
   Linear(FFModel &model,
@@ -37,8 +41,17 @@ class Linear : public Op {
          bool allocate_weights = false);
 
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override;
   bool get_int_parameter(PMParameter, int *) const override;
   static Op *
@@ -49,6 +62,10 @@ class Linear : public Op {
                            std::vector<Legion::PhysicalRegion> const &regions,
                            Legion::Context ctx,
                            Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
   static void forward_task(Legion::Task const *task,
                            std::vector<Legion::PhysicalRegion> const &regions,
                            Legion::Context ctx,
@@ -86,19 +103,19 @@ class Linear : public Op {
          bool allocate_weights,
          char const *name);
 
-  template <int NDIM>
+  template <typename DT, typename WT, int NDIM>
   static OpMeta *
       init_task_with_dim(Legion::Task const *task,
                          std::vector<Legion::PhysicalRegion> const &regions,
                          Legion::Context ctx,
                          Legion::Runtime *runtime);
-  template <int NDIM>
+  template <typename DT, typename WT, int NDIM>
   static void
       forward_task_with_dim(Legion::Task const *task,
                             std::vector<Legion::PhysicalRegion> const &regions,
                             Legion::Context ctx,
                             Legion::Runtime *runtime);
-  template <int NDIM>
+  template <typename DT, int NDIM>
   static void
       backward_task_with_dim(Legion::Task const *task,
                              std::vector<Legion::PhysicalRegion> const &regions,
@@ -116,6 +133,8 @@ class Linear : public Op {
   float kernel_reg_lambda;
   bool use_bias;
   ParallelTensor replica;
+  DataType quantization_type;
+  bool offload;
 };
 
 }; // namespace FlexFlow
diff --git a/include/flexflow/ops/linear_params.h b/include/flexflow/ops/linear_params.h
index 2c41694960..563304e89f 100644
--- a/include/flexflow/ops/linear_params.h
+++ b/include/flexflow/ops/linear_params.h
@@ -18,6 +18,8 @@ class LinearParams {
   ActiMode activation;
   RegularizerMode kernel_reg_type;
   float kernel_reg_lambda;
+  DataType quantization_type;
+  bool offload;
 
   bool is_valid(ParallelTensorShape const &input_shape) const;
   void solve_dims(const ParallelTensor input,
diff --git a/include/flexflow/ops/noop.h b/include/flexflow/ops/noop.h
index 5f39c999e6..e07d10a05e 100644
--- a/include/flexflow/ops/noop.h
+++ b/include/flexflow/ops/noop.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_NOOP_H
 #define _FLEXFLOW_NOOP_H
 
+#include "flexflow/inference.h"
 #include "flexflow/model.h"
 
 namespace FlexFlow {
@@ -17,7 +18,16 @@ class NoOp : public Op {
        const ParallelTensor output,
        char const *name = NULL);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   void print_layer(FFModel const &model) override {
     assert(0);
diff --git a/include/flexflow/ops/rms_norm.h b/include/flexflow/ops/rms_norm.h
new file mode 100644
index 0000000000..979a20976c
--- /dev/null
+++ b/include/flexflow/ops/rms_norm.h
@@ -0,0 +1,83 @@
+#ifndef _FLEXFLOW_RMS_NORM_H
+#define _FLEXFLOW_RMS_NORM_H
+
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include "flexflow/ops/rms_norm_params.h"
+#include "flexflow/utils/memory_allocator.h"
+
+namespace FlexFlow {
+
+class RMSNormMeta;
+
+class RMSNorm : public Op {
+public:
+  using Params = RMSNormParams;
+  using Input = ParallelTensor;
+  RMSNorm(FFModel &model,
+          LayerID const &_layer_guid,
+          const ParallelTensor _input,
+          float _eps,
+          int dim,
+          bool allocate_weights,
+          char const *name);
+  RMSNorm(FFModel &model,
+          RMSNormParams const &params,
+          ParallelTensor input,
+          bool allocate_weights,
+          char const *name = nullptr);
+
+  RMSNorm(FFModel &model,
+          RMSNorm const &other,
+          const ParallelTensor input,
+          bool allocate_weights);
+  void init(FFModel const &);
+  void forward(FFModel const &);
+  void backward(FFModel const &);
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) {
+    assert(0);
+  }
+
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+  void serialize(Legion::Serializer &) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               ParallelTensor inputs[],
+                               int num_inputs);
+  Op *materialize(FFModel &ff,
+                  ParallelTensor inputs[],
+                  int num_inputs) const override;
+  RMSNormParams get_params() const;
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void forward_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const;
+
+public:
+  float eps;
+  char op_name[MAX_OPNAME];
+  int effective_batch_size;
+  int dim, data_dim;
+};
+} // namespace FlexFlow
+#endif // _FLEXFLOW_RMS_NORM_H
diff --git a/include/flexflow/ops/rms_norm_params.h b/include/flexflow/ops/rms_norm_params.h
new file mode 100644
index 0000000000..82a459009a
--- /dev/null
+++ b/include/flexflow/ops/rms_norm_params.h
@@ -0,0 +1,26 @@
+#ifndef _FLEXFLOW_RMSNORM_PARAMS_H
+#define _FLEXFLOW_RMSNORM_PARAMS_H
+
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct RMSNormParams {
+  LayerID layer_guid;
+  float eps;
+  int dim;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+
+bool operator==(RMSNormParams const &, RMSNormParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::RMSNormParams> {
+  size_t operator()(FlexFlow::RMSNormParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_RMSNORM_PARAMS_H
\ No newline at end of file
diff --git a/include/flexflow/ops/sampling.h b/include/flexflow/ops/sampling.h
new file mode 100644
index 0000000000..789904df32
--- /dev/null
+++ b/include/flexflow/ops/sampling.h
@@ -0,0 +1,112 @@
+#ifndef _FLEXFLOW_SAMPLING_TOPK_H_
+#define _FLEXFLOW_SAMPLING_TOPK_H_
+
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include "flexflow/node.h"
+#include "flexflow/ops/sampling_params.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include <curand.h>
+#include <curand_kernel.h>
+#endif
+#include "flexflow/utils/memory_allocator.h"
+
+namespace FlexFlow {
+
+class SamplingMeta : public OpMeta {
+public:
+  float top_p;
+  void *sorted_logits;
+  int *sorted_idx;
+  int *begin_offset;
+  int *end_offset;
+  int *idx;
+  void *d_temp_storage;
+  size_t temp_storage_bytes;
+  Realm::RegionInstance reserveInst;
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+  curandState *state;
+#endif
+  SamplingMeta(FFHandler handle,
+               Op const *op,
+               int batch_size,
+               int total_ele,
+               GenericTensorAccessorW input,
+               MemoryAllocator &gpu_mem_allocator);
+  ~SamplingMeta(void);
+};
+
+class Sampling : public Op {
+public:
+  using Params = SamplingParams;
+  using Input = ParallelTensor;
+  Sampling(FFModel &model,
+           const ParallelTensor input,
+           float top_p,
+           char const *name);
+  Sampling(FFModel &model, Sampling const &other, const ParallelTensor input);
+  Sampling(FFModel &model,
+           Params const &params,
+           Input const input,
+           char const *name = nullptr);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static InferenceResult
+      inference_task(Legion::Task const *task,
+                     std::vector<Legion::PhysicalRegion> const &regions,
+                     Legion::Context ctx,
+                     Legion::Runtime *runtime);
+  void serialize(Legion::Serializer &s) const override;
+  static PCG::Node deserialize(FFModel &ff,
+                               Legion::Deserializer &d,
+                               ParallelTensor inputs[],
+                               int num_inputs);
+  Op *materialize(FFModel &ff,
+                  ParallelTensor inputs[],
+                  int num_inputs) const override;
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const override;
+  template <typename DT>
+  static void forward_kernel(SamplingMeta const *m,
+                             DT *input_ptr,
+                             int *indices_ptr,
+                             float top_p,
+                             int length,
+                             int batch_size,
+                             ffStream_t stream);
+  static void forward_kernel_wrapper(SamplingMeta const *m,
+                                     GenericTensorAccessorW const &input,
+                                     GenericTensorAccessorW const &indices,
+                                     int batch_size);
+  Params get_params() const;
+
+public:
+  float top_p;
+};
+
+}; // namespace FlexFlow
+
+#endif
\ No newline at end of file
diff --git a/include/flexflow/ops/sampling_params.h b/include/flexflow/ops/sampling_params.h
new file mode 100644
index 0000000000..1449ddbf54
--- /dev/null
+++ b/include/flexflow/ops/sampling_params.h
@@ -0,0 +1,24 @@
+#ifndef _FLEXFLOW_SAMPLING_PARAMS_H
+#define _FLEXFLOW_SAMPLING_PARAMS_H
+
+#include "flexflow/ffconst.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct SamplingParams {
+  float top_p;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+bool operator==(SamplingParams const &, SamplingParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::SamplingParams> {
+  size_t operator()(FlexFlow::SamplingParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_SAMPLING_PARAMS_H
\ No newline at end of file
diff --git a/include/flexflow/ops/softmax.h b/include/flexflow/ops/softmax.h
index 25a20315bd..1d5191d7ee 100644
--- a/include/flexflow/ops/softmax.h
+++ b/include/flexflow/ops/softmax.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_SOFTMAX_H
 #define _FLEXFLOW_SOFTMAX_H
 
+#include "flexflow/inference.h"
 #include "flexflow/layer.h"
 #include "flexflow/node.h"
 #include "flexflow/operator.h"
@@ -21,7 +22,16 @@ class Softmax : public Op {
           const Input input,
           char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   bool get_int_parameter(PMParameter, int *) const override;
   void print_layer(FFModel const &model) override {
@@ -43,19 +53,24 @@ class Softmax : public Op {
                             std::vector<Legion::PhysicalRegion> const &regions,
                             Legion::Context ctx,
                             Legion::Runtime *runtime);
+  static InferenceResult
+      inference_task(Legion::Task const *task,
+                     std::vector<Legion::PhysicalRegion> const &regions,
+                     Legion::Context ctx,
+                     Legion::Runtime *runtime);
   bool measure_operator_cost(Simulator *sim,
                              MachineView const &pc,
                              CostMetrics &cost_metrics) const override;
   Params get_params() const;
 
 private:
-  template <int NDIM>
+  template <typename DT, int NDIM>
   static void
       forward_task_with_dim(Legion::Task const *task,
                             std::vector<Legion::PhysicalRegion> const &regions,
                             Legion::Context ctx,
                             Legion::Runtime *runtime);
-  template <int NDIM>
+  template <typename DT, int NDIM>
   static void
       backward_task_with_dim(Legion::Task const *task,
                              std::vector<Legion::PhysicalRegion> const &regions,
diff --git a/include/flexflow/ops/spec_inc_multihead_self_attention.h b/include/flexflow/ops/spec_inc_multihead_self_attention.h
new file mode 100644
index 0000000000..c6364805e3
--- /dev/null
+++ b/include/flexflow/ops/spec_inc_multihead_self_attention.h
@@ -0,0 +1,148 @@
+#ifndef _FLEXFLOW_SPEC_INC_MULTIHEAD_SELF_ATTENTION_H
+#define _FLEXFLOW_SPEC_INC_MULTIHEAD_SELF_ATTENTION_H
+
+#include "flexflow/accessor.h"
+#include "flexflow/device.h"
+#include "flexflow/fftype.h"
+#include "flexflow/inference.h"
+#include "flexflow/layer.h"
+#include "flexflow/node.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/operator.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention_params.h"
+#include "math.h"
+#include <cfloat>
+#include <complex>
+
+namespace FlexFlow {
+
+class SpecIncMultiHeadSelfAttentionMeta;
+
+class SpecIncMultiHeadSelfAttention : public Op {
+public:
+  using Params = SpecIncMultiHeadSelfAttentionParams;
+  using Input = ParallelTensor;
+
+  SpecIncMultiHeadSelfAttention(FFModel &model,
+                                LayerID const &layer_guid,
+                                const ParallelTensor _input,
+                                int _embed_dim,
+                                int _num_q_heads,
+                                int _num_kv_heads,
+                                int _kdim,
+                                int _vdim,
+                                float _dropout,
+                                bool _bias,
+                                bool _add_bias_kv,
+                                bool _add_zero_attn,
+                                bool _apply_rotary_embedding,
+                                bool _scaling_query,
+                                float _scaling_factor,
+                                bool _qk_prod_scaling,
+                                bool allocate_weights,
+                                char const *name);
+  SpecIncMultiHeadSelfAttention(FFModel &model,
+                                const ParallelTensor _input,
+                                const ParallelTensor _weight,
+                                int _embed_dim,
+                                int _num_q_heads,
+                                int _num_kv_heads,
+                                int _kdim,
+                                int _vdim,
+                                float _dropout,
+                                bool _bias,
+                                bool _add_bias_kv,
+                                bool _add_zero_attn,
+                                bool _apply_rotary_embedding,
+                                bool _scaling_query,
+                                float _scaling_factor,
+                                bool _qk_prod_scaling,
+                                bool allocate_weights,
+                                char const *name);
+  SpecIncMultiHeadSelfAttention(FFModel &model,
+                                SpecIncMultiHeadSelfAttention const &other,
+                                const ParallelTensor input,
+                                bool allocate_weights);
+  SpecIncMultiHeadSelfAttention(FFModel &model,
+                                Params const &params,
+                                Input const &inputs,
+                                bool allocate_weights = false,
+                                char const *name = nullptr);
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  bool get_int_parameter(PMParameter, int *) const override;
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
+  Op *materialize(FFModel &ff,
+                  ParallelTensor inputs[],
+                  int num_inputs) const override;
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &mv,
+                             CostMetrics &cost_metrics) const override;
+
+  static void
+      inference_kernel_wrapper(SpecIncMultiHeadSelfAttentionMeta const *m,
+                               BeamSearchBatchConfig const *bc,
+                               int shard_id,
+                               GenericTensorAccessorR const &input,
+                               GenericTensorAccessorR const &weight,
+                               GenericTensorAccessorW const &output,
+                               GenericTensorAccessorR const &bias);
+  Params get_params() const;
+
+public:
+  int num_q_heads, num_kv_heads, tensor_parallelism_degree;
+  float dropout, scaling_factor;
+  bool bias;
+  bool add_bias_kv, add_zero_attn, apply_rotary_embedding, scaling_query,
+      qk_prod_scaling;
+  int qSize, kSize, vSize, qProjSize, kProjSize, vProjSize, oProjSize;
+  int qoSeqLength, kvSeqLength;
+};
+
+class SpecIncMultiHeadSelfAttentionMeta : public IncMultiHeadSelfAttentionMeta {
+public:
+  SpecIncMultiHeadSelfAttentionMeta(FFHandler handler,
+                                    SpecIncMultiHeadSelfAttention const *attn,
+                                    GenericTensorAccessorR const &weight,
+                                    MemoryAllocator &gpu_mem_allocator,
+                                    int num_samples,
+                                    int _num_q_heads,
+                                    int _num_kv_heads);
+  ~SpecIncMultiHeadSelfAttentionMeta(void);
+
+public:
+  Realm::RegionInstance beam_search_reserve_inst;
+  BatchConfig::PerRequestInfo *request_infos;
+  BeamSearchBatchConfig::BeamSearchPerTokenInfo *beam_token_infos;
+  BeamSearchBatchConfig::BeamSearchPerRequestInfo *beam_request_infos;
+};
+
+}; // namespace FlexFlow
+
+#endif // _FLEXFLOW_SPEC_INC_MULTIHEAD_SELF_ATTENTION_H
diff --git a/include/flexflow/ops/spec_inc_multihead_self_attention_params.h b/include/flexflow/ops/spec_inc_multihead_self_attention_params.h
new file mode 100644
index 0000000000..d6f08dd9e6
--- /dev/null
+++ b/include/flexflow/ops/spec_inc_multihead_self_attention_params.h
@@ -0,0 +1,32 @@
+#ifndef _FLEXFLOW_SPEC_INC_MULTIHEAD_SELF_ATTENTION_PARAMS_H
+#define _FLEXFLOW_SPEC_INC_MULTIHEAD_SELF_ATTENTION_PARAMS_H
+
+#include "flexflow/fftype.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct SpecIncMultiHeadSelfAttentionParams {
+  LayerID layer_guid;
+  int embed_dim, num_q_heads, num_kv_heads, kdim, vdim;
+  float dropout, scaling_factor;
+  bool bias, add_bias_kv, add_zero_attn, apply_rotary_embedding, scaling_query,
+      qk_prod_scaling;
+
+  bool is_valid(ParallelTensorShape const &) const;
+};
+
+bool operator==(SpecIncMultiHeadSelfAttentionParams const &,
+                SpecIncMultiHeadSelfAttentionParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::SpecIncMultiHeadSelfAttentionParams> {
+  size_t
+      operator()(FlexFlow::SpecIncMultiHeadSelfAttentionParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_SPEC_INC_MULTIHEAD_SELF_ATTENTION_PARAMS_H
diff --git a/include/flexflow/ops/split.h b/include/flexflow/ops/split.h
index 633268ffbf..cb9c6bdb57 100644
--- a/include/flexflow/ops/split.h
+++ b/include/flexflow/ops/split.h
@@ -22,6 +22,15 @@ class Split : public Op {
         const Input input,
         char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
   void print_layer(FFModel const &model) override {
diff --git a/include/flexflow/ops/topk.h b/include/flexflow/ops/topk.h
index 6b1613c828..47144bf6d7 100644
--- a/include/flexflow/ops/topk.h
+++ b/include/flexflow/ops/topk.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_TOPK_H_
 #define _FLEXFLOW_TOPK_H_
 
+#include "flexflow/inference.h"
 #include "flexflow/model.h"
 #include "flexflow/node.h"
 #include "flexflow/ops/topk_params.h"
@@ -28,8 +29,17 @@ class TopK : public Op {
        Input const input,
        char const *name = nullptr);
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
   void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void print_layer(FFModel const &model) override {
     assert(0);
   }
diff --git a/include/flexflow/ops/tree_inc_multihead_self_attention.h b/include/flexflow/ops/tree_inc_multihead_self_attention.h
new file mode 100644
index 0000000000..d5be344cca
--- /dev/null
+++ b/include/flexflow/ops/tree_inc_multihead_self_attention.h
@@ -0,0 +1,152 @@
+#ifndef _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_VERIFY_H
+#define _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_VERIFY_H
+
+#include "flexflow/accessor.h"
+#include "flexflow/device.h"
+#include "flexflow/fftype.h"
+#include "flexflow/inference.h"
+#include "flexflow/layer.h"
+#include "flexflow/node.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/operator.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention_params.h"
+#include "math.h"
+#include <cfloat>
+#include <complex>
+
+namespace FlexFlow {
+
+class TreeIncMultiHeadSelfAttentionMeta;
+
+class TreeIncMultiHeadSelfAttention : public Op {
+public:
+  using Params = TreeIncMultiHeadSelfAttentionParams;
+  using Input = ParallelTensor;
+
+  TreeIncMultiHeadSelfAttention(FFModel &model,
+                                LayerID const &layer_guid,
+                                const ParallelTensor _input,
+                                int _embed_dim,
+                                int _num_q_heads,
+                                int _num_kv_heads,
+                                int _kdim,
+                                int _vdim,
+                                float _dropout,
+                                bool _bias,
+                                bool _add_bias_kv,
+                                bool _add_zero_attn,
+                                bool _apply_rotary_embedding,
+                                bool _scaling_query,
+                                float _scaling_factor,
+                                bool _qk_prod_scaling,
+                                bool allocate_weights,
+                                DataType _quantization_type,
+                                bool _offload,
+                                int _tensor_parallelism_degree,
+                                char const *name);
+  TreeIncMultiHeadSelfAttention(FFModel &model,
+                                const ParallelTensor _input,
+                                const ParallelTensor _weight,
+                                int _embed_dim,
+                                int _num_q_heads,
+                                int _num_kv_heads,
+                                int _kdim,
+                                int _vdim,
+                                float _dropout,
+                                bool _bias,
+                                bool _add_bias_kv,
+                                bool _add_zero_attn,
+                                bool _apply_rotary_embedding,
+                                bool _scaling_query,
+                                float _scaling_factor,
+                                bool _qk_prod_scaling,
+                                bool allocate_weights,
+                                DataType _quantization_type,
+                                bool _offload,
+                                int _tensor_parallelism_degree,
+                                char const *name);
+  TreeIncMultiHeadSelfAttention(FFModel &model,
+                                TreeIncMultiHeadSelfAttention const &other,
+                                const ParallelTensor input,
+                                bool allocate_weights);
+  TreeIncMultiHeadSelfAttention(FFModel &model,
+                                Params const &params,
+                                Input const &inputs,
+                                bool allocate_weights = false,
+                                char const *name = nullptr);
+  static Op *
+      create_operator_from_layer(FFModel &model,
+                                 Layer const *layer,
+                                 std::vector<ParallelTensor> const &inputs);
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  void backward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void print_layer(FFModel const &model) override {
+    assert(0);
+  }
+  bool get_int_parameter(PMParameter, int *) const override;
+
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &mv,
+                             CostMetrics &cost_metrics) const override;
+
+  static void inference_kernel_wrapper(TreeIncMultiHeadSelfAttentionMeta *m,
+                                       TreeVerifyBatchConfig const *bc,
+                                       int shard_id,
+                                       GenericTensorAccessorR const &input,
+                                       GenericTensorAccessorR const &weight,
+                                       GenericTensorAccessorW const &output,
+                                       GenericTensorAccessorR const &bias);
+
+  Params get_params() const;
+
+public:
+  int num_q_heads, num_kv_heads, tensor_parallelism_degree;
+  float dropout, scaling_factor;
+  bool bias;
+  bool add_bias_kv, add_zero_attn, apply_rotary_embedding, scaling_query,
+      qk_prod_scaling;
+  int qSize, kSize, vSize, qProjSize, kProjSize, vProjSize, oProjSize;
+  int qoSeqLength, kvSeqLength;
+  DataType quantization_type;
+  bool offload;
+};
+
+class TreeIncMultiHeadSelfAttentionMeta : public IncMultiHeadSelfAttentionMeta {
+public:
+  TreeIncMultiHeadSelfAttentionMeta(FFHandler handler,
+                                    TreeIncMultiHeadSelfAttention const *attn,
+                                    GenericTensorAccessorR const &weight,
+                                    MemoryAllocator &gpu_mem_allocator,
+                                    int num_samples,
+                                    int _num_q_heads,
+                                    int _num_kv_heads);
+  ~TreeIncMultiHeadSelfAttentionMeta(void);
+
+public:
+  int num_active_tokens;
+  Realm::RegionInstance committed_token_reserve_inst;
+  TreeVerifyBatchConfig::CommittedTokensInfo *committed_token_infos;
+};
+
+}; // namespace FlexFlow
+
+#endif // _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_VERIFY_H
diff --git a/include/flexflow/ops/tree_inc_multihead_self_attention_params.h b/include/flexflow/ops/tree_inc_multihead_self_attention_params.h
new file mode 100644
index 0000000000..3ba49dcbad
--- /dev/null
+++ b/include/flexflow/ops/tree_inc_multihead_self_attention_params.h
@@ -0,0 +1,34 @@
+#ifndef _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_VERIFY_PARAMS_H
+#define _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_VERIFY_PARAMS_H
+
+#include "flexflow/fftype.h"
+#include "flexflow/parallel_tensor.h"
+
+namespace FlexFlow {
+
+struct TreeIncMultiHeadSelfAttentionParams {
+  LayerID layer_guid;
+  int embed_dim, num_q_heads, kdim, vdim, num_kv_heads,
+      tensor_parallelism_degree;
+  float dropout, scaling_factor;
+  bool bias, add_bias_kv, add_zero_attn, apply_rotary_embedding, scaling_query,
+      qk_prod_scaling;
+  DataType quantization_type;
+  bool offload;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+
+bool operator==(TreeIncMultiHeadSelfAttentionParams const &,
+                TreeIncMultiHeadSelfAttentionParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::TreeIncMultiHeadSelfAttentionParams> {
+  size_t
+      operator()(FlexFlow::TreeIncMultiHeadSelfAttentionParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_INC_MULTIHEAD_SELF_ATTENTION_VERIFY_PARAMS_H
diff --git a/include/flexflow/parallel_ops/allreduce.h b/include/flexflow/parallel_ops/allreduce.h
new file mode 100644
index 0000000000..045f9b36a0
--- /dev/null
+++ b/include/flexflow/parallel_ops/allreduce.h
@@ -0,0 +1,74 @@
+#ifndef _FLEXFLOW_ALLREDUCE_H
+#define _FLEXFLOW_ALLREDUCE_H
+
+#include "flexflow/layer.h"
+#include "flexflow/node.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/operator.h"
+#include "flexflow/parallel_ops/allreduce_params.h"
+#include "parallel_op.h"
+
+namespace FlexFlow {
+
+class AllReduce : public ParallelOp {
+public:
+  using Params = AllReduceParams;
+  using Input = ParallelTensor;
+
+  AllReduce(FFModel &model,
+            const ParallelTensor input,
+            int allreduce_legion_dim,
+            char const *name = NULL);
+  AllReduce(FFModel &model,
+            Params const &params,
+            Input const input,
+            char const *name = nullptr);
+  void create_input_partition(FFModel &model) override;
+  void create_input_partition_inference(
+      FFModel &model,
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs) override;
+  void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
+  void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
+  void backward(FFModel const &) override;
+  bool get_int_parameter(PMParameter, int *) const override;
+  bool append_parallel_op_info(
+      std::vector<ParallelOpInfo> &parallel_ops) const override;
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void inference_task(Legion::Task const *task,
+                             std::vector<Legion::PhysicalRegion> const &regions,
+                             Legion::Context ctx,
+                             Legion::Runtime *runtime);
+  static void forward_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
+  static void backward_task(Legion::Task const *task,
+                            std::vector<Legion::PhysicalRegion> const &regions,
+                            Legion::Context ctx,
+                            Legion::Runtime *runtime);
+  bool measure_operator_cost(Simulator *sim,
+                             MachineView const &pc,
+                             CostMetrics &cost_metrics) const override;
+
+  Params get_params() const;
+
+public:
+  int allreduce_dim;
+};
+
+}; // namespace FlexFlow
+
+#endif // _FLEXFLOW_ALLREDUCE_H
diff --git a/include/flexflow/parallel_ops/allreduce_params.h b/include/flexflow/parallel_ops/allreduce_params.h
new file mode 100644
index 0000000000..c04676ffeb
--- /dev/null
+++ b/include/flexflow/parallel_ops/allreduce_params.h
@@ -0,0 +1,21 @@
+#ifndef _FLEXFLOW_ALLREDUCE_PARAMS_H
+#define _FLEXFLOW_ALLREDUCE_PARAMS_H
+
+namespace FlexFlow {
+
+struct AllReduceParams {
+  int allreduce_legion_dim;
+  bool is_valid(ParallelTensorShape const &) const;
+};
+bool operator==(AllReduceParams const &, AllReduceParams const &);
+
+} // namespace FlexFlow
+
+namespace std {
+template <>
+struct hash<FlexFlow::AllReduceParams> {
+  size_t operator()(FlexFlow::AllReduceParams const &) const;
+};
+} // namespace std
+
+#endif // _FLEXFLOW_ALLREDUCE_PARAMS_H
diff --git a/include/flexflow/parallel_ops/combine.h b/include/flexflow/parallel_ops/combine.h
index 310e599f54..2e4fdb86a9 100644
--- a/include/flexflow/parallel_ops/combine.h
+++ b/include/flexflow/parallel_ops/combine.h
@@ -3,6 +3,7 @@
 
 #include "flexflow/layer.h"
 #include "flexflow/node.h"
+#include "flexflow/op_meta.h"
 #include "flexflow/operator.h"
 #include "flexflow/parallel_ops/combine_params.h"
 #include "parallel_op.h"
@@ -24,8 +25,21 @@ class Combine : public ParallelOp {
           Input const input,
           char const *name = nullptr);
   void create_input_partition(FFModel &model) override;
+  void create_input_partition_inference(
+      FFModel &model,
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs) override;
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   bool get_int_parameter(PMParameter, int *) const override;
   bool append_parallel_op_info(
diff --git a/include/flexflow/parallel_ops/kernels/allreduce_kernels.h b/include/flexflow/parallel_ops/kernels/allreduce_kernels.h
new file mode 100644
index 0000000000..bdf7aae501
--- /dev/null
+++ b/include/flexflow/parallel_ops/kernels/allreduce_kernels.h
@@ -0,0 +1,37 @@
+#ifndef _FLEXFLOW_OPS_KERNELS_ALLREDUCE_KERNELS_H
+#define _FLEXFLOW_OPS_KERNELS_ALLREDUCE_KERNELS_H
+
+#include "flexflow/batch_config.h"
+#include "flexflow/device.h"
+#include "flexflow/fftype.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/parallel_ops/allreduce.h"
+
+namespace FlexFlow {
+
+class AllReduceMeta : public OpMeta {
+public:
+  AllReduceMeta(FFHandler handle, AllReduce const *reduct);
+};
+
+namespace Kernels {
+namespace AllReduce {
+
+void inference_kernel_wrapper(AllReduceMeta const *m,
+                              BatchConfig const *bc,
+                              GenericTensorAccessorR const &input,
+                              GenericTensorAccessorW const &output);
+
+void forward_kernel_wrapper(AllReduceMeta const *m,
+                            GenericTensorAccessorR const &input,
+                            GenericTensorAccessorW const &output);
+
+void backward_kernel_wrapper(AllReduceMeta const *m,
+                             GenericTensorAccessorW const &input_grad,
+                             GenericTensorAccessorR const &output_grad);
+
+} // namespace AllReduce
+} // namespace Kernels
+} // namespace FlexFlow
+
+#endif // _FLEXFLOW_OPS_KERNELS_ALLREDUCE_KERNELS_H
diff --git a/include/flexflow/parallel_ops/kernels/combine_kernels.h b/include/flexflow/parallel_ops/kernels/combine_kernels.h
index 6f540679a2..456013cd81 100644
--- a/include/flexflow/parallel_ops/kernels/combine_kernels.h
+++ b/include/flexflow/parallel_ops/kernels/combine_kernels.h
@@ -4,6 +4,7 @@
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
 #include "flexflow/op_meta.h"
+#include "flexflow/parallel_ops/combine.h"
 
 namespace FlexFlow {
 
diff --git a/include/flexflow/parallel_ops/kernels/reduction_kernels.h b/include/flexflow/parallel_ops/kernels/reduction_kernels.h
index e9f6a9d070..51ddced227 100644
--- a/include/flexflow/parallel_ops/kernels/reduction_kernels.h
+++ b/include/flexflow/parallel_ops/kernels/reduction_kernels.h
@@ -3,8 +3,16 @@
 
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/parallel_ops/reduction.h"
 
 namespace FlexFlow {
+
+class ReductionMeta : public OpMeta {
+public:
+  ReductionMeta(FFHandler handle, Reduction const *reduct);
+};
+
 namespace Kernels {
 namespace Reduction {
 
diff --git a/include/flexflow/parallel_ops/kernels/replicate_kernels.h b/include/flexflow/parallel_ops/kernels/replicate_kernels.h
index 619d06efef..d5d52797c3 100644
--- a/include/flexflow/parallel_ops/kernels/replicate_kernels.h
+++ b/include/flexflow/parallel_ops/kernels/replicate_kernels.h
@@ -3,8 +3,16 @@
 
 #include "flexflow/device.h"
 #include "flexflow/fftype.h"
+#include "flexflow/op_meta.h"
+#include "flexflow/parallel_ops/replicate.h"
 
 namespace FlexFlow {
+
+class ReplicateMeta : public OpMeta {
+public:
+  ReplicateMeta(FFHandler handle, Replicate const *repl);
+};
+
 namespace Kernels {
 namespace Replicate {
 
diff --git a/include/flexflow/parallel_ops/parallel_op.h b/include/flexflow/parallel_ops/parallel_op.h
index a374b7ab40..0bf573996c 100644
--- a/include/flexflow/parallel_ops/parallel_op.h
+++ b/include/flexflow/parallel_ops/parallel_op.h
@@ -24,6 +24,12 @@ class ParallelOp : public Op {
   virtual void forward(FFModel const &) = 0;
   virtual void backward(FFModel const &) = 0;
   virtual void create_input_partition(FFModel &model) = 0;
+  virtual void create_input_partition_inference(
+      FFModel &model,
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs) {
+    assert(false);
+  }
   void print_layer(FFModel const &model){};
   virtual bool measure_operator_cost(Simulator *sim,
                                      MachineView const &pc,
@@ -34,6 +40,8 @@ class ParallelOp : public Op {
 
 public:
   Legion::LogicalPartition input_lp, output_grad_lp;
+  std::unordered_map<ParallelTensor, Legion::LogicalPartition>
+      inference_input_lps;
 };
 
 }; // namespace FlexFlow
diff --git a/include/flexflow/parallel_ops/partition.h b/include/flexflow/parallel_ops/partition.h
index 5c2fa9c228..4b0013b11d 100644
--- a/include/flexflow/parallel_ops/partition.h
+++ b/include/flexflow/parallel_ops/partition.h
@@ -1,6 +1,7 @@
 #ifndef _FLEXFLOW_PARTITION_H
 #define _FLEXFLOW_PARTITION_H
 
+#include "flexflow/inference.h"
 #include "flexflow/layer.h"
 #include "flexflow/node.h"
 #include "flexflow/operator.h"
@@ -24,8 +25,21 @@ class Repartition : public ParallelOp {
               Input const input,
               char const *name = nullptr);
   void create_input_partition(FFModel &model) override;
+  void create_input_partition_inference(
+      FFModel &model,
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs) override;
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   bool get_int_parameter(PMParameter, int *) const override;
   bool append_parallel_op_info(
diff --git a/include/flexflow/parallel_ops/reduction.h b/include/flexflow/parallel_ops/reduction.h
index fed5f049c7..89f8bfbee0 100644
--- a/include/flexflow/parallel_ops/reduction.h
+++ b/include/flexflow/parallel_ops/reduction.h
@@ -25,12 +25,29 @@ class Reduction : public ParallelOp {
             Input const input,
             char const *name = nullptr);
   void create_input_partition(FFModel &model) override;
+  void create_input_partition_inference(
+      FFModel &model,
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs) override;
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   bool get_int_parameter(PMParameter, int *) const override;
   bool append_parallel_op_info(
       std::vector<ParallelOpInfo> &parallel_ops) const override;
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
   static void forward_task(Legion::Task const *task,
                            std::vector<Legion::PhysicalRegion> const &regions,
                            Legion::Context ctx,
diff --git a/include/flexflow/parallel_ops/replicate.h b/include/flexflow/parallel_ops/replicate.h
index 381f690cdc..65d69d8564 100644
--- a/include/flexflow/parallel_ops/replicate.h
+++ b/include/flexflow/parallel_ops/replicate.h
@@ -10,6 +10,8 @@
 
 namespace FlexFlow {
 
+class ReplicateMeta;
+
 class Replicate : public ParallelOp {
 public:
   using Params = ReplicateParams;
@@ -25,12 +27,29 @@ class Replicate : public ParallelOp {
             Input const input,
             char const *name = nullptr);
   void create_input_partition(FFModel &model) override;
+  void create_input_partition_inference(
+      FFModel &model,
+      std::vector<ParallelTensor> const &batch_inputs,
+      std::vector<ParallelTensor> const &batch_outputs) override;
   void init(FFModel const &) override;
+  void init_inference(FFModel const &,
+                      std::vector<ParallelTensor> const &,
+                      std::vector<ParallelTensor> const &,
+                      MachineView const *mv = nullptr) override;
   void forward(FFModel const &) override;
+  Legion::FutureMap inference(FFModel const &,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &,
+                              std::vector<ParallelTensor> const &,
+                              MachineView const *mv = nullptr) override;
   void backward(FFModel const &) override;
   bool get_int_parameter(PMParameter, int *) const override;
   bool append_parallel_op_info(
       std::vector<ParallelOpInfo> &parallel_ops) const override;
+  static OpMeta *init_task(Legion::Task const *task,
+                           std::vector<Legion::PhysicalRegion> const &regions,
+                           Legion::Context ctx,
+                           Legion::Runtime *runtime);
   static void forward_task(Legion::Task const *task,
                            std::vector<Legion::PhysicalRegion> const &regions,
                            Legion::Context ctx,
@@ -39,6 +58,11 @@ class Replicate : public ParallelOp {
                             std::vector<Legion::PhysicalRegion> const &regions,
                             Legion::Context ctx,
                             Legion::Runtime *runtime);
+  static void forward_kernel_wrapper(ReplicateMeta const *m,
+                                     GenericTensorAccessorR const &input,
+                                     GenericTensorAccessorW const &output,
+                                     size_t num_elements,
+                                     size_t num_replicas);
   bool measure_operator_cost(Simulator *sim,
                              MachineView const &pc,
                              CostMetrics &cost_metrics) const override;
diff --git a/include/flexflow/parallel_tensor.h b/include/flexflow/parallel_tensor.h
index db77b49030..d06ecd7bac 100644
--- a/include/flexflow/parallel_tensor.h
+++ b/include/flexflow/parallel_tensor.h
@@ -169,6 +169,20 @@ struct ParallelTensorBase {
   bool get_tensor(FFModel const *model, T *data, bool get_parameters);
   ParallelTensorShape get_shape() const;
 
+  template <typename T>
+  bool tensor_equal(FFConfig &config, ParallelTensorBase &tensor);
+  static bool
+      tensor_equal_task(Legion::Task const *task,
+                        std::vector<Legion::PhysicalRegion> const &regions,
+                        Legion::Context ctx,
+                        Legion::Runtime *runtime);
+  template <int NDIM>
+  static bool tensor_equal_task_with_dim(
+      Legion::Task const *task,
+      std::vector<Legion::PhysicalRegion> const &regions,
+      Legion::Context ctx,
+      Legion::Runtime *runtime);
+
 private:
   template <typename T>
   bool get_input_sub_tensor_via_mappings(ParallelConfig const &pc,
diff --git a/include/flexflow/request_manager.h b/include/flexflow/request_manager.h
new file mode 100644
index 0000000000..e444402dd0
--- /dev/null
+++ b/include/flexflow/request_manager.h
@@ -0,0 +1,239 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "flexflow/batch_config.h"
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+#include <mutex>
+#include <tokenizers_cpp.h>
+
+namespace FlexFlow {
+
+class FFModel;
+class BeamTree;
+class RequestManager;
+using tokenizers::Tokenizer;
+
+class InferenceManager {
+public:
+  InferenceManager(FFConfig const &config, int max_num_tokens_per_batch);
+  static InferenceManager *get_inference_manager();
+  void compile_model_and_allocate_buffer(FFModel *model);
+  void init_operators_inference(FFModel *model);
+  Legion::FutureMap inference(FFModel *model, int index, BatchConfig const &bc);
+  Legion::FutureMap
+      inference(FFModel *model, int index, BatchConfigFuture const &bc);
+  void load_input_tokens_from_batch_config(BatchConfigFuture const &bc,
+                                           ParallelTensor const input);
+  void load_positions(BatchConfigFuture const &bc,
+                      ParallelTensor position_input,
+                      int offset);
+
+public:
+  FFConfig ff_config;
+  std::unordered_map<ParallelTensor, std::vector<ParallelTensor>> tensor_buffer;
+  int max_num_tokens_per_batch;
+  int num_devices;
+};
+
+struct Request {
+  enum Status {
+    PENDING = 101,
+    RUNNING = 102,
+    COMPLETED = 103,
+  };
+  BatchConfig::RequestGuid guid;
+  int max_sequence_length;
+  int initial_len;
+  Status status = PENDING;
+  std::vector<BatchConfig::TokenId> tokens;
+
+  std::vector<struct BeamTree> beam_trees;
+};
+
+// store the result of beam search
+struct BeamTree {
+  struct treeLayer {
+    BeamSearchBatchConfig::TokenId
+        tokens[BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+    int parent_ids[BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+    float probs[BeamSearchBatchConfig::MAX_BEAM_WIDTH];
+  };
+  treeLayer treeLayers[BeamSearchBatchConfig::MAX_BEAM_DEPTH + 1];
+};
+
+// struct BeamTree_v2 {
+//   std::vector<BatchConfig::TokenId> tokens;
+//   std::vector<int> parent_ids;
+//   std::vector<float> probs;
+// };
+
+class RequestManager {
+public:
+  using RequestGuid = BatchConfig::RequestGuid;
+  using TokenId = BatchConfig::TokenId;
+
+  RequestManager();
+  static RequestManager *get_request_manager();
+  size_t get_num_processed_requests();
+  size_t get_num_ssms();
+
+  int register_ssm_model(FFModel *model);
+  void register_tokenizer(ModelType model_type,
+                          int bos_token_id,
+                          int eos_token_id,
+                          std::string const &path);
+  void register_output_filepath(std::string const &);
+
+  FFModel *get_model(int model_id);
+
+  GenerationResult generate_incr_decoding(FFModel *model,
+                                          std::string const &text,
+                                          int max_seq_length);
+  GenerationResult generate_spec_infer(FFModel *model,
+                                       std::string const &text,
+                                       int max_seq_length);
+  GenerationResult get_generation_result(RequestGuid const &guid);
+  RequestGuid register_new_request(std::string const &prompt,
+                                   int max_sequence_length);
+  RequestGuid register_new_request(std::vector<TokenId> const &prompt,
+                                   int max_sequence_length);
+  bool is_request_completed(RequestGuid const &guid);
+  BatchConfig prepare_next_batch(BatchConfig const &bc,
+                                 InferenceResult const &result);
+  BatchConfigFuture prepare_next_batch(BatchConfigFuture const &bc,
+                                       InferenceResultFuture const &result);
+  BeamSearchBatchConfig
+      prepare_next_batch_beam(BeamSearchBatchConfig const &old_bc,
+                              BeamInferenceResult const &result);
+  BeamSearchBatchConfigFuture
+      prepare_next_batch_beam(BeamSearchBatchConfigFuture const &old_bc,
+                              BeamInferenceResultFuture const &result);
+  BeamSearchBatchConfig
+      prepare_next_batch_init(TreeVerifyBatchConfig const &old_bc,
+                              InferenceResult const &result,
+                              int model_id);
+  BeamSearchBatchConfigFuture
+      prepare_next_batch_init(TreeVerifyBatchConfigFuture const &old_bc,
+                              InferenceResultFuture const &result,
+                              int model_id);
+  TreeVerifyBatchConfig prepare_next_batch_verify(
+      std::vector<BeamSearchBatchConfig> const &old_batches);
+  TreeVerifyBatchConfigFuture prepare_next_batch_verify(
+      std::vector<BeamSearchBatchConfigFuture> const &old_batches);
+
+  void store_beam_metadata(BeamSearchBatchConfig const &old_bc,
+                           BeamInferenceResult const &result);
+  void update_beam_metadata(BeamSearchBatchConfig &new_bc,
+                            BeamTree &tree,
+                            int request_index);
+
+  std::vector<std::pair<BatchConfig::TokenId, int>>
+      traverse_beam_tree(BeamSearchBatchConfig const &old_bc,
+                         int request_index,
+                         int token_start_offset);
+
+  // remove guid after put the cached tree in request
+  std::vector<std::pair<BatchConfig::TokenId, int>> merge_dfs_trees(
+      std::vector<std::vector<std::pair<BatchConfig::TokenId, int>>>
+          input_trees,
+      int root_depth,
+      RequestGuid guid);
+
+  std::vector<std::pair<BatchConfig::TokenId, int>> traverse_verify_tree(
+      size_t guid,
+      std::vector<std::pair<BatchConfig::TokenId, int>> const
+          &inputSerializedTree,
+      std::vector<std::pair<BatchConfig::TokenId, int>> const
+          &outputSerializedTree);
+
+  static void
+      load_tokens_task(Legion::Task const *task,
+                       std::vector<Legion::PhysicalRegion> const &regions,
+                       Legion::Context ctx,
+                       Legion::Runtime *runtime);
+  static void
+      load_positions_task(Legion::Task const *task,
+                          std::vector<Legion::PhysicalRegion> const &regions,
+                          Legion::Context ctx,
+                          Legion::Runtime *runtime);
+
+  static BatchConfig prepare_next_batch_task(
+      Legion::Task const *task,
+      std::vector<Legion::PhysicalRegion> const &regions,
+      Legion::Context ctx,
+      Legion::Runtime *runtime);
+
+  static BeamSearchBatchConfig prepare_next_batch_beam_task(
+      Legion::Task const *task,
+      std::vector<Legion::PhysicalRegion> const &regions,
+      Legion::Context ctx,
+      Legion::Runtime *runtime);
+
+  static BeamSearchBatchConfig prepare_next_batch_init_task(
+      Legion::Task const *task,
+      std::vector<Legion::PhysicalRegion> const &regions,
+      Legion::Context ctx,
+      Legion::Runtime *runtime);
+
+  static TreeVerifyBatchConfig prepare_next_batch_verify_task(
+      Legion::Task const *task,
+      std::vector<Legion::PhysicalRegion> const &regions,
+      Legion::Context ctx,
+      Legion::Runtime *runtime);
+
+private:
+  std::unique_ptr<Tokenizer> tokenizer_;
+  bool verbose;
+  ModelType model_type;
+  int bos_token_id;
+  int eos_token_id;
+  std::string output_filepath;
+  std::queue<Request> pending_request_queue;
+  std::unordered_map<RequestGuid, Request> all_requests;
+  std::unordered_map<RequestGuid, GenerationResult> request_generation_results;
+  std::mutex request_queue_mutex;
+  RequestGuid next_available_guid;
+  // Legion futures for inc_decoding and spec_infer
+  BatchConfigFuture last_bcf;
+  InferenceResultFuture last_irf;
+  TreeVerifyBatchConfigFuture last_tree_bcf;
+  InferenceResultFuture last_tree_irf;
+
+  // TODO: Move this two vector to request struct
+  std::unordered_map<RequestGuid,
+                     std::vector<std::pair<BatchConfig::TokenId, int>>>
+      dfs_tree_inputs;
+  std::unordered_map<RequestGuid, std::vector<std::pair<int, int>>>
+      committed_tokens;
+
+  // Multi-model support
+  std::vector<FFModel *> models;
+
+  // Performance profiling
+  size_t num_processed_requests;
+
+private:
+  struct ProfileInfo {
+    int decoding_steps;
+    double start_time, finish_time;
+  };
+  std::unordered_map<RequestGuid, ProfileInfo> profiling_requests;
+  double total_request_run_time;
+};
+
+}; // namespace FlexFlow
diff --git a/include/flexflow/runtime.h b/include/flexflow/runtime.h
new file mode 100644
index 0000000000..e1371300ec
--- /dev/null
+++ b/include/flexflow/runtime.h
@@ -0,0 +1,31 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef _FLEXFLOW_RUNTIME_H_
+#define _FLEXFLOW_RUNTIME_H_
+
+#include "config.h"
+
+namespace FlexFlow {
+
+class FFRuntime {
+public:
+  FFRuntime(FFConfig &config);
+  FFHandler handlers[MAX_NUM_WORKERS];
+};
+
+} // namespace FlexFlow
+
+#endif // _FLEXFLOW_RUNTIME_H_
diff --git a/include/flexflow/simulator.h b/include/flexflow/simulator.h
index 9ee1b1eb09..e410f66325 100644
--- a/include/flexflow/simulator.h
+++ b/include/flexflow/simulator.h
@@ -38,6 +38,7 @@ class LinearMeta;
 class Pool2DMeta;
 class ElementUnaryMeta;
 class ElementBinaryMeta;
+class LayerNormMeta;
 // class EmbeddingMeta;
 // class SoftmaxMeta;
 class BatchMatmulMeta;
@@ -684,8 +685,6 @@ class TaskManager {
   std::map<size_t, SimTask *> hash_to_forward_task, hash_to_backward_task;
 };
 
-size_t data_type_size(DataType);
-
 using ProfilingRecordKey = std::tuple<OperatorParameters, MachineView>;
 
 class Simulator {
@@ -756,7 +755,8 @@ class Simulator {
   LinearMeta *linear_meta;
   Pool2DMeta *pool2d_meta;
   ElementUnaryMeta *ele_unary_meta;
-  ElementBinaryMeta *ele_binary_meta;
+  LayerNormMeta *layernorm_meta;
+  // ElementBinaryMeta *ele_binary_meta;
   // EmbeddingMeta *embedding_meta;
   // SoftmaxMeta *softmax_meta;
   BatchMatmulMeta *batch_matmul_meta;
diff --git a/include/flexflow/substitution_loader.h b/include/flexflow/substitution_loader.h
index 9f9db223f2..776fe2c78e 100644
--- a/include/flexflow/substitution_loader.h
+++ b/include/flexflow/substitution_loader.h
@@ -41,95 +41,97 @@ NLOHMANN_JSON_SERIALIZE_ENUM(PMParameter,
                               {PM_PARALLEL_DEGREE, "PM_PARALLEL_DEGREE"},
                               {PM_PAD, "PM_PAD"}})
 
-NLOHMANN_JSON_SERIALIZE_ENUM(OperatorType,
-                             {{OP_INVALID, nullptr},
-                              {OP_NOOP, "OP_NOOP"},
-                              {OP_CONV2D, "OP_CONV2D"},
-                              {OP_DROPOUT, "OP_DROPOUT"},
-                              {OP_LINEAR, "OP_LINEAR"},
-                              {OP_BATCHMATMUL, "OP_BATCHMATMUL"},
-                              {OP_POOL2D, "OP_POOL2D_MAX"},
-                              {OP_SCALAR_MULTIPLY, "OP_SCALAR_MULTIPLY"},
-                              {OP_SCALAR_ADD, "OP_SCALAR_ADD"},
-                              {OP_SCALAR_FLOOR_DIV, "OP_SCALAR_FLOOR_DIV"},
-                              {OP_SCALAR_TRUE_DIV, "OP_SCALAR_TRUE_DIV"},
-                              {OP_SCALAR_SUB, "OP_SCALAR_SUB"},
-                              {OP_RELU, "OP_RELU"},
-                              {OP_IDENTITY, "OP_IDENTITY"},
-                              {OP_SIGMOID, "OP_SIGMOID"},
-                              {OP_TANH, "OP_TANH"},
-                              {OP_ELU, "OP_ELU"},
-                              {OP_FLAT, "OP_FLAT"},
-                              {OP_SOFTMAX, "OP_SOFTMAX"},
-                              {OP_BATCHNORM, "OP_BATCHNORM"},
-                              {OP_CONCAT, "OP_CONCAT"},
-                              {OP_SPLIT, "OP_SPLIT"},
-                              {OP_EMBEDDING, "OP_EMBEDDING"},
-                              {OP_GROUP_BY, "OP_GROUP_BY"},
-                              {OP_CACHE, "OP_CACHE"},
-                              {OP_AGGREGATE, "OP_AGGREGATE"},
-                              {OP_AGG_SPEC, "OP_AGG_SPEC"},
-                              {OP_RESHAPE, "OP_RESHAPE"},
-                              {OP_REVERSE, "OP_REVERSE"},
-                              {OP_TRANSPOSE, "OP_TRANSPOSE"},
-                              {OP_EW_ADD, "OP_EW_ADD"},
-                              {OP_EW_MUL, "OP_EW_MUL"},
-                              {OP_MATMUL, "OP_MATMUL"},
-                              {OP_MUL, "OP_MUL"},
-                              {OP_ENLARGE, "OP_ENLARGE"},
-                              {OP_MERGE_GCONV, "OP_MERGE_GCONV"},
-                              {OP_CONSTANT_IMM, "OP_CONSTANT_IMM"},
-                              {OP_CONSTANT_ICONV, "OP_CONSTANT_ICONV"},
-                              {OP_CONSTANT_ONE, "OP_CONSTANT_ONE"},
-                              {OP_CONSTANT_POOL, "OP_CONSTANT_POOL"},
-                              {OP_SQUEEZE, "OP_SQUEEZE"},
-                              {OP_UNSQUEEZE, "OP_UNSQUEEZE"},
-                              {OP_EW_SUB, "OP_EW_SUB"},
-                              {OP_EW_DIV, "OP_EW_DIV"},
-                              {OP_EW_EQUAL, "OP_EW_EQUAL"},
-                              {OP_EW_GREATER, "OP_EW_GREATER"},
-                              {OP_EW_LESS, "OP_EW_LESS"},
-                              {OP_EW_MAX, "OP_EW_MAX"},
-                              {OP_EW_MIN, "OP_EW_MIN"},
-                              {OP_REDUCE_ARGMAX, "OP_REDUCE_ARGMAX"},
-                              {OP_REDUCE_ARGMIN, "OP_REDUCE_ARGMIN"},
-                              {OP_REDUCE_MAX, "OP_REDUCE_MAX"},
-                              {OP_REDUCE_MEAN, "OP_REDUCE_MEAN"},
-                              {OP_REDUCE_MIN, "OP_REDUCE_MIN"},
-                              {OP_REDUCE_PROD, "OP_REDUCE_PROD"},
-                              {OP_REDUCE_SUM, "OP_REDUCE_SUM"},
-                              {OP_PAD, "OP_PAD"},
-                              {OP_SHAPE, "OP_SHAPE"},
-                              {OP_SIZE, "OP_SIZE"},
-                              {OP_TOPK, "OP_TOPK"},
-                              {OP_WHERE, "OP_WHERE"},
-                              {OP_CEIL, "OP_CEIL"},
-                              {OP_CAST, "OP_CAST"},
-                              {OP_EXP, "OP_EXP"},
-                              {OP_ROUND, "OP_ROUND"},
-                              {OP_LOG, "OP_LOG"},
-                              {OP_LOGICAL_NOT, "OP_LOGICAL_NOT"},
-                              {OP_SQRT, "OP_SQRT"},
-                              {OP_SIN, "OP_SIN"},
-                              {OP_COS, "OP_COS"},
-                              {OP_LEAKYRELU, "OP_LEAKYRELU"},
-                              {OP_SLICE, "OP_SLICE"},
-                              {OP_RESIZE, "OP_RESIZE"},
-                              {OP_PRELU, "OP_PRELU"},
-                              {OP_GELU, "OP_GELU"},
-                              {OP_MULTIHEAD_ATTENTION,
-                               "OP_MULTIHEAD_ATTENTION"},
-                              {OP_FUSED, "OP_FUSED"},
-                              {OP_RSQRT, "OP_RSQRT"},
-                              {OP_POW, "OP_POW"},
-                              {OP_MEAN, "OP_MEAN"},
-                              {OP_LAYERNORM, "OP_LAYERNORM"},
-                              {OP_REPARTITION, "OP_PARTITION"},
-                              {OP_COMBINE, "OP_COMBINE"},
-                              {OP_REPLICATE, "OP_REPLICATE"},
-                              {OP_REDUCTION, "OP_REDUCE"},
-                              {OP_PIPELINE, "OP_PIPELINE"},
-                              {OP_FUSED_PARALLEL, "OP_FUSED_PARALLEL"}})
+NLOHMANN_JSON_SERIALIZE_ENUM(
+    OperatorType,
+    {{OP_INVALID, nullptr},
+     {OP_NOOP, "OP_NOOP"},
+     {OP_CONV2D, "OP_CONV2D"},
+     {OP_DROPOUT, "OP_DROPOUT"},
+     {OP_LINEAR, "OP_LINEAR"},
+     {OP_BATCHMATMUL, "OP_BATCHMATMUL"},
+     {OP_POOL2D, "OP_POOL2D_MAX"},
+     {OP_SCALAR_MULTIPLY, "OP_SCALAR_MULTIPLY"},
+     {OP_SCALAR_ADD, "OP_SCALAR_ADD"},
+     {OP_SCALAR_FLOOR_DIV, "OP_SCALAR_FLOOR_DIV"},
+     {OP_SCALAR_TRUE_DIV, "OP_SCALAR_TRUE_DIV"},
+     {OP_SCALAR_SUB, "OP_SCALAR_SUB"},
+     {OP_RELU, "OP_RELU"},
+     {OP_IDENTITY, "OP_IDENTITY"},
+     {OP_SIGMOID, "OP_SIGMOID"},
+     {OP_TANH, "OP_TANH"},
+     {OP_ELU, "OP_ELU"},
+     {OP_FLAT, "OP_FLAT"},
+     {OP_SOFTMAX, "OP_SOFTMAX"},
+     {OP_BATCHNORM, "OP_BATCHNORM"},
+     {OP_CONCAT, "OP_CONCAT"},
+     {OP_SPLIT, "OP_SPLIT"},
+     {OP_EMBEDDING, "OP_EMBEDDING"},
+     {OP_GROUP_BY, "OP_GROUP_BY"},
+     {OP_CACHE, "OP_CACHE"},
+     {OP_AGGREGATE, "OP_AGGREGATE"},
+     {OP_AGG_SPEC, "OP_AGG_SPEC"},
+     {OP_RESHAPE, "OP_RESHAPE"},
+     {OP_REVERSE, "OP_REVERSE"},
+     {OP_TRANSPOSE, "OP_TRANSPOSE"},
+     {OP_EW_ADD, "OP_EW_ADD"},
+     {OP_EW_MUL, "OP_EW_MUL"},
+     {OP_MATMUL, "OP_MATMUL"},
+     {OP_MUL, "OP_MUL"},
+     {OP_ENLARGE, "OP_ENLARGE"},
+     {OP_MERGE_GCONV, "OP_MERGE_GCONV"},
+     {OP_CONSTANT_IMM, "OP_CONSTANT_IMM"},
+     {OP_CONSTANT_ICONV, "OP_CONSTANT_ICONV"},
+     {OP_CONSTANT_ONE, "OP_CONSTANT_ONE"},
+     {OP_CONSTANT_POOL, "OP_CONSTANT_POOL"},
+     {OP_SQUEEZE, "OP_SQUEEZE"},
+     {OP_UNSQUEEZE, "OP_UNSQUEEZE"},
+     {OP_EW_SUB, "OP_EW_SUB"},
+     {OP_EW_DIV, "OP_EW_DIV"},
+     {OP_EW_EQUAL, "OP_EW_EQUAL"},
+     {OP_EW_GREATER, "OP_EW_GREATER"},
+     {OP_EW_LESS, "OP_EW_LESS"},
+     {OP_EW_MAX, "OP_EW_MAX"},
+     {OP_EW_MIN, "OP_EW_MIN"},
+     {OP_REDUCE_ARGMAX, "OP_REDUCE_ARGMAX"},
+     {OP_REDUCE_ARGMIN, "OP_REDUCE_ARGMIN"},
+     {OP_REDUCE_MAX, "OP_REDUCE_MAX"},
+     {OP_REDUCE_MEAN, "OP_REDUCE_MEAN"},
+     {OP_REDUCE_MIN, "OP_REDUCE_MIN"},
+     {OP_REDUCE_PROD, "OP_REDUCE_PROD"},
+     {OP_REDUCE_SUM, "OP_REDUCE_SUM"},
+     {OP_PAD, "OP_PAD"},
+     {OP_SHAPE, "OP_SHAPE"},
+     {OP_SIZE, "OP_SIZE"},
+     {OP_TOPK, "OP_TOPK"},
+     {OP_WHERE, "OP_WHERE"},
+     {OP_CEIL, "OP_CEIL"},
+     {OP_CAST, "OP_CAST"},
+     {OP_EXP, "OP_EXP"},
+     {OP_ROUND, "OP_ROUND"},
+     {OP_LOG, "OP_LOG"},
+     {OP_LOGICAL_NOT, "OP_LOGICAL_NOT"},
+     {OP_SQRT, "OP_SQRT"},
+     {OP_SIN, "OP_SIN"},
+     {OP_COS, "OP_COS"},
+     {OP_LEAKYRELU, "OP_LEAKYRELU"},
+     {OP_SLICE, "OP_SLICE"},
+     {OP_RESIZE, "OP_RESIZE"},
+     {OP_PRELU, "OP_PRELU"},
+     {OP_GELU, "OP_GELU"},
+     {OP_MULTIHEAD_ATTENTION, "OP_MULTIHEAD_ATTENTION"},
+     {OP_INC_MULTIHEAD_SELF_ATTENTION, "OP_INC_MULTIHEAD_SELF_ATTENTION"},
+     {OP_FUSED, "OP_FUSED"},
+     {OP_RSQRT, "OP_RSQRT"},
+     {OP_POW, "OP_POW"},
+     {OP_MEAN, "OP_MEAN"},
+     {OP_LAYERNORM, "OP_LAYERNORM"},
+     {OP_RMS_NORM, "OP_RMS_NORM"},
+     {OP_REPARTITION, "OP_PARTITION"},
+     {OP_COMBINE, "OP_COMBINE"},
+     {OP_REPLICATE, "OP_REPLICATE"},
+     {OP_REDUCTION, "OP_REDUCE"},
+     {OP_PIPELINE, "OP_PIPELINE"},
+     {OP_FUSED_PARALLEL, "OP_FUSED_PARALLEL"}})
 
 namespace FlexFlow {
 namespace substitution_loader {
diff --git a/include/flexflow/utils/cuda_helper.h b/include/flexflow/utils/cuda_helper.h
index 46e323b186..f8bf67b3e1 100644
--- a/include/flexflow/utils/cuda_helper.h
+++ b/include/flexflow/utils/cuda_helper.h
@@ -1,9 +1,13 @@
 #ifndef _FLEXFLOW_CUDA_HELPER_H_
 #define _FLEXFLOW_CUDA_HELPER_H_
+#include "flexflow/accessor.h"
 #include "flexflow/ffconst.h"
 #include "legion.h"
 #include <cublas_v2.h>
 #include <cudnn.h>
+#ifdef FF_USE_NCCL
+#include <nccl.h>
+#endif
 
 #define FatalError(s)                                                          \
   do {                                                                         \
@@ -82,6 +86,12 @@ __global__ void assign_kernel(DT *ptr, Legion::coord_t size, DT value);
 template <typename DT>
 __global__ void copy_kernel(DT *dst, const DT *src, Legion::coord_t size);
 
+template <typename DT>
+__global__ void copy_kernel_discrete(DT *dst,
+                                     const DT *src,
+                                     Legion::coord_t size,
+                                     size_t *index);
+
 template <typename T>
 __global__ void add_kernel(T *data_ptr, T const *grad_ptr, size_t size);
 
@@ -131,12 +141,41 @@ __host__ void updateGAS(float *para_ptr,
                         float learning_rate);
 
 template <typename T>
-void print_tensor(T const *ptr, size_t num_elements, char const *prefix);
+void print_tensor(T const *ptr,
+                  size_t num_elements,
+                  char const *prefix,
+                  int shard_id = 0);
+template <typename T>
+void print_beam_tensor(T const *ptr,
+                       size_t num_elements,
+                       int skip,
+                       int channel,
+                       char const *prefix);
+
+template <typename T>
+void save_tensor(T const *ptr, size_t num_elements, char const *file_name);
+
+template <typename T>
+T *download_tensor(T const *ptr, size_t num_elements);
+
+template <typename T>
+bool download_tensor(T const *ptr, T *dst, size_t num_elements);
 
 cudnnStatus_t cudnnSetTensorDescriptorFromDomain(cudnnTensorDescriptor_t tensor,
-                                                 Legion::Domain domain);
+                                                 Legion::Domain domain,
+                                                 DataType data_type = DT_FLOAT);
 
-cudaDataType_t ff_to_cuda_datatype(DataType type);
+cudnnStatus_t
+    cudnnSetTensorDescriptorFromDomain4SoftMax(cudnnTensorDescriptor_t tensor,
+                                               Legion::Domain domain,
+                                               DataType data_type = DT_FLOAT);
 
+cudaDataType_t ff_to_cuda_datatype(DataType type);
 cudnnDataType_t ff_to_cudnn_datatype(DataType type);
-#endif
\ No newline at end of file
+#ifdef FF_USE_NCCL
+ncclDataType_t ff_to_nccl_datatype(DataType type);
+#endif
+
+cudaDataType_t cudnn_to_cuda_datatype(cudnnDataType_t type);
+cudnnDataType_t cuda_to_cudnn_datatype(cudaDataType_t type);
+#endif
diff --git a/include/flexflow/utils/hip_helper.h b/include/flexflow/utils/hip_helper.h
index 6970832231..d16f353ade 100644
--- a/include/flexflow/utils/hip_helper.h
+++ b/include/flexflow/utils/hip_helper.h
@@ -1,5 +1,6 @@
 #ifndef _FLEXFLOW_HIP_HELPER_H_
 #define _FLEXFLOW_HIP_HELPER_H_
+#include "flexflow/accessor.h"
 #include "flexflow/ffconst.h"
 #include "legion.h"
 #include <hipblas.h>
@@ -133,9 +134,16 @@ __host__ void updateGAS(float *para_ptr,
 template <typename T>
 void print_tensor(T const *ptr, size_t num_elements, char const *prefix);
 
+template <typename T>
+T *download_tensor(T const *ptr, size_t num_elements);
+
+template <typename T>
+bool download_tensor(T const *ptr, T *dst, size_t num_elements);
+
 miopenStatus_t
     cudnnSetTensorDescriptorFromDomain(miopenTensorDescriptor_t tensor,
-                                       Legion::Domain domain);
+                                       Legion::Domain domain,
+                                       DataType data_type = DT_FLOAT);
 
 hipblasDatatype_t ff_to_cuda_datatype(DataType type);
 
diff --git a/include/flexflow/utils/memory_allocator.h b/include/flexflow/utils/memory_allocator.h
new file mode 100644
index 0000000000..8e50a4c3b3
--- /dev/null
+++ b/include/flexflow/utils/memory_allocator.h
@@ -0,0 +1,67 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef _FLEXFLOW_UTILS_MEMORY_ALLOCATOR_H_
+#define _FLEXFLOW_UTILS_MEMORY_ALLOCATOR_H_
+
+#include "flexflow/config.h"
+
+namespace FlexFlow {
+
+class MemoryAllocator {
+public:
+  MemoryAllocator(Legion::Memory memory);
+  void create_legion_instance(Realm::RegionInstance &inst, size_t size);
+  void register_reserved_work_space(void *base, size_t size);
+  inline void *allocate_reserved_untyped(size_t datalen) {
+    void *ptr = static_cast<char *>(reserved_ptr) + reserved_allocated_size;
+    reserved_allocated_size += datalen;
+    assert(reserved_allocated_size <= reserved_total_size);
+    return ptr;
+  }
+  template <typename DT>
+  inline DT *allocate_reserved(size_t count) {
+    void *ptr = static_cast<char *>(reserved_ptr) + reserved_allocated_size;
+    reserved_allocated_size += sizeof(DT) * count;
+    assert(reserved_allocated_size <= reserved_total_size);
+    return static_cast<DT *>(ptr);
+  }
+
+  inline void *allocate_instance_untyped(size_t datalen) {
+    void *ptr = static_cast<char *>(instance_ptr) + instance_allocated_size;
+    instance_allocated_size += datalen;
+    assert(instance_allocated_size <= instance_total_size);
+    return ptr;
+  }
+
+  template <typename DT>
+  inline DT *allocate_instance(size_t count) {
+    void *ptr = static_cast<char *>(instance_ptr) + instance_allocated_size;
+    instance_allocated_size += sizeof(DT) * count;
+    assert(instance_allocated_size <= instance_total_size);
+    return static_cast<DT *>(ptr);
+  }
+
+public:
+  Legion::Memory memory;
+  void *reserved_ptr;
+  void *instance_ptr;
+  size_t reserved_total_size, reserved_allocated_size;
+  size_t instance_total_size, instance_allocated_size;
+};
+
+}; // namespace FlexFlow
+
+#endif // _FLEXFLOW_RUNTIME_H_
diff --git a/inference/.gitignore b/inference/.gitignore
new file mode 100644
index 0000000000..8ab99cb1eb
--- /dev/null
+++ b/inference/.gitignore
@@ -0,0 +1,5 @@
+configs
+weights
+tokenizers
+prompt
+output
diff --git a/inference/MODEL_WEIGHTS.md b/inference/MODEL_WEIGHTS.md
new file mode 100644
index 0000000000..e46e6b45d1
--- /dev/null
+++ b/inference/MODEL_WEIGHTS.md
@@ -0,0 +1,28 @@
+To convert the weights of a HuggingFace LLM to SpecInfer's weight format, we first load the model and modify the tensor names to match SpecInfer's convention, and then convert these tensors to numpy arrays to store them in binary files.
+
+```python
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
+
+for name, params in model.named_parameters():
+    for name, params in model.named_parameters():
+    name = (
+        name.replace(".", "_")
+        .replace("self_attn", "attention")
+        .replace("q_proj", "wq")
+        .replace("k_proj", "wk")
+        .replace("v_proj", "wv")
+        .replace("o_proj", "wo")
+        .replace("mlp", "feed_forward")
+        .replace("gate_proj", "w1")
+        .replace("down_proj", "w2")
+        .replace("up_proj", "w3")
+        .replace("input_layernorm", "attention_norm")
+        .replace("post_attention_layernorm", "ffn_norm")
+        .replace("embed_tokens", "tok_embeddings")
+        .replace("lm_head", "output")
+        .replace("model_", "")
+    )
+    params.detach().cpu().numpy().tofile('weights/llama_7B_weights/' + name)
+```
+
diff --git a/inference/file_loader.cc b/inference/file_loader.cc
new file mode 100644
index 0000000000..78f190dad6
--- /dev/null
+++ b/inference/file_loader.cc
@@ -0,0 +1,752 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "file_loader.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/inference.h"
+
+#include <vector>
+using namespace std;
+
+using namespace Legion;
+
+FileDataLoader::FileDataLoader(std::string _input_path,
+                               std::string _weight_file_path,
+                               int _num_heads,
+                               int _num_kv_heads,
+                               size_t _hidden_dim,
+                               size_t _qkv_inner_dim,
+                               int _tensor_parallelism_degree)
+    : input_path(_input_path), weight_file_path(_weight_file_path),
+      num_heads(_num_heads), num_kv_heads(_num_kv_heads),
+      hidden_dim(_hidden_dim), qkv_inner_dim(_qkv_inner_dim),
+      tensor_parallelism_degree(_tensor_parallelism_degree){};
+
+BatchConfig::TokenId *FileDataLoader::generate_requests(int num, int length) {
+
+  BatchConfig::TokenId *prompts =
+      (BatchConfig::TokenId *)malloc(sizeof(BatchConfig::TokenId) * 40);
+  std::ifstream in(input_path, std::ios::in | std::ios::binary);
+  int size = num * length;
+  std::vector<long> host_array(size);
+  size_t loaded_data_size = sizeof(long) * size;
+
+  in.seekg(0, in.end);
+  in.seekg(0, in.beg);
+  in.read((char *)host_array.data(), loaded_data_size);
+
+  size_t in_get_size = in.gcount();
+  if (in_get_size != loaded_data_size) {
+    std::cout << "load data error" << std::endl;
+    return prompts;
+  }
+
+  assert(size == host_array.size());
+  int index = 0;
+  int data_index = 0;
+
+  for (auto v : host_array) {
+    prompts[data_index++] = v;
+  }
+  in.close();
+  return prompts;
+};
+
+template <typename DT>
+void load_attention_weights_multi_query(DT *ptr,
+                                        std::string layer_name,
+                                        std::string weight_path,
+                                        size_t hidden_dim,
+                                        int num_heads) {
+
+  std::string qkv_file = weight_path +
+                         layer_name.substr(0, layer_name.find("attention")) +
+                         "attention_query_key_value_weight";
+  std::string o_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_dense_weight";
+
+  // q has n_heads heads, k and v only have one head, o have n_head heads
+  std::vector<std::string> weight_files = {qkv_file, o_file};
+  int file_index = 0;
+  int data_index = 0;
+  for (auto file : weight_files) {
+    size_t partial_size =
+        file_index == 0 ? (hidden_dim + 2 * hidden_dim / num_heads) * hidden_dim
+                        : hidden_dim * hidden_dim;
+
+    std::ifstream in(file, std::ios::in | std::ios::binary);
+    // std::cout << "Loading filename: " << file << std::endl;
+    if (!in.good()) {
+      std::cout << "Could not open file: " << file << std::endl;
+    }
+    assert(in.good() && "incorrect weight file path");
+    std::vector<DT> host_array(partial_size);
+    size_t loaded_data_size = sizeof(DT) * partial_size;
+    in.seekg(0, in.end);
+    in.seekg(0, in.beg);
+    in.read((char *)host_array.data(), loaded_data_size);
+    size_t in_get_size = in.gcount();
+
+    if (in_get_size != loaded_data_size) {
+      std::cout << "load data error " << in_get_size << ", "
+                << loaded_data_size;
+      assert(false && "data size mismatch");
+    }
+    for (int i = 0; i < partial_size; i++) {
+      ptr[data_index++] = host_array.at(i);
+    }
+    file_index++;
+  }
+}
+
+template <typename DT>
+void load_attention_bias_v2(DT *ptr,
+                            int num_heads,
+                            int num_kv_heads,
+                            size_t hidden_dim,
+                            size_t qkv_inner_dim,
+                            std::string layer_name,
+                            std::string weight_path) {
+  std::string q_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wq_bias";
+  std::string k_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wk_bias";
+  std::string v_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wv_bias";
+  std::string o_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wo_bias";
+  std::vector<std::string> bias_files = {q_file, k_file, v_file, o_file};
+
+  int file_index = 0;
+
+  // now only opt use this.
+  // assert(num_heads == num_kv_heads);
+  int idx = 0;
+
+  for (auto file : bias_files) {
+    int n_heads = file_index == 0 ? num_heads : num_kv_heads;
+    size_t qkv_partial_size = qkv_inner_dim * n_heads;
+    size_t out_partial_size = hidden_dim;
+    size_t partial_size =
+        (file_index < 3) ? qkv_partial_size : out_partial_size;
+    std::ifstream in(file, std::ios::in | std::ios::binary);
+    assert(in.good() && "incorrect bias file path");
+    std::vector<DT> host_array(partial_size);
+    size_t loaded_data_size = sizeof(DT) * partial_size;
+    in.seekg(0, in.end);
+    in.seekg(0, in.beg);
+    in.read((char *)host_array.data(), loaded_data_size);
+    size_t in_get_size = in.gcount();
+
+    if (in_get_size != loaded_data_size) {
+      printf(
+          "load bias data error: in_get_size (%lu) != loaded_data_size (%lu)\n",
+          in_get_size,
+          loaded_data_size);
+      assert(false);
+    }
+    assert(partial_size == host_array.size());
+
+    size_t data_index = 0;
+
+    for (int i = 0; i < partial_size; i++) {
+      ptr[idx + i] = host_array.at(data_index);
+      data_index++;
+    }
+
+    file_index++;
+    idx += qkv_partial_size;
+
+    in.close();
+  }
+}
+
+template <typename DT>
+void load_attention_weights_v2(DT *ptr,
+                               int num_heads,
+                               int num_kv_heads,
+                               size_t hidden_dim,
+                               size_t qkv_inner_dim,
+                               std::string layer_name,
+                               std::string weight_path,
+                               size_t volume,
+                               int tensor_parallelism_degree) {
+  // layers_0_attention_wq_weight
+  // layers_0_self_attn_q_proj_weight
+  std::string q_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wq_weight";
+  std::string k_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wk_weight";
+  std::string v_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wv_weight";
+  std::string o_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wo_weight";
+  std::vector<std::string> weight_files = {q_file, k_file, v_file};
+  int file_index = 0;
+
+  int base_index = 0;
+  size_t single_proj_size =
+      hidden_dim *
+      qkv_inner_dim; // size of each of Q,K,V,O weights for a single head
+  size_t one_weight_file_size =
+      num_heads * single_proj_size; // size of each of Q/K/V/O for all heads
+
+  size_t q_size = one_weight_file_size, o_size = one_weight_file_size;
+  size_t k_size = single_proj_size * num_kv_heads,
+         v_size = single_proj_size * num_kv_heads;
+
+  // stride for q, k, v, o
+  size_t stride_size =
+      (q_size + v_size + k_size + o_size) / tensor_parallelism_degree;
+  for (auto file : weight_files) {
+    int data_index = 0;
+    size_t partial_size = (file_index == 0 || file_index == 3)
+                              ? one_weight_file_size
+                              : single_proj_size * num_kv_heads;
+    size_t one_partition_size = partial_size / tensor_parallelism_degree;
+
+    std::ifstream in(file, std::ios::in | std::ios::binary);
+    if (!in.good()) {
+      std::cout << "Could not open file: " << file << std::endl;
+    }
+    assert(in.good() && "incorrect weight file path");
+    std::vector<DT> host_array(partial_size);
+    size_t loaded_data_size = sizeof(DT) * partial_size;
+    in.seekg(0, in.end);
+    in.seekg(0, in.beg);
+    in.read((char *)host_array.data(), loaded_data_size);
+    size_t in_get_size = in.gcount();
+
+    if (in_get_size != loaded_data_size) {
+      std::cout << "load attention data error " << in_get_size << ", "
+                << loaded_data_size << ", " << file_index << ", " << file
+                << "\n";
+      assert(false && "data size mismatch");
+    }
+    // wq, wk, wo
+    for (int i = 0; i < tensor_parallelism_degree; i++) {
+      for (int j = 0; j < one_partition_size; j++) {
+        ptr[base_index + i * stride_size + j] = host_array.at(data_index++);
+      }
+    }
+    assert(data_index == partial_size);
+    base_index += one_partition_size;
+    file_index++;
+  }
+  assert(base_index == (q_size + k_size + v_size) / tensor_parallelism_degree);
+
+  {
+    std::ifstream in(o_file, std::ios::in | std::ios::binary);
+    if (!in.good()) {
+      std::cout << "Could not open file: " << o_file << std::endl;
+    }
+    assert(in.good() && "incorrect weight file path");
+    std::vector<DT> host_array(one_weight_file_size);
+    size_t loaded_data_size = sizeof(DT) * one_weight_file_size;
+    in.seekg(0, in.end);
+    in.seekg(0, in.beg);
+    in.read((char *)host_array.data(), loaded_data_size);
+    size_t in_get_size = in.gcount();
+
+    if (in_get_size != loaded_data_size) {
+      std::cout << "load data error" << std::endl;
+      assert(false);
+    }
+    assert(one_weight_file_size == host_array.size());
+    int data_index = 0;
+
+    int one_partition_size =
+        qkv_inner_dim * (num_heads / tensor_parallelism_degree);
+    for (int i = 0; i < one_weight_file_size; i++) {
+      int part_idx = (i / one_partition_size) % tensor_parallelism_degree;
+      int block_num = (i / one_partition_size);
+      int offset = block_num / tensor_parallelism_degree * one_partition_size +
+                   (i % one_partition_size);
+      ptr[base_index + part_idx * stride_size + offset] =
+          host_array.at(data_index++);
+    }
+
+    in.close();
+
+    assert(data_index == one_weight_file_size);
+  }
+}
+
+template <typename DT>
+void load_from_file(DT *ptr, size_t size, std::string filename) {
+  std::ifstream in(filename, std::ios::in | std::ios::binary);
+  if (!in.good()) {
+    std::cout << "Could not open file: " << filename << std::endl;
+  }
+  assert(in.good() && "incorrect weight file path");
+  std::vector<DT> host_array(size);
+  size_t loaded_data_size = sizeof(DT) * size;
+  in.seekg(0, in.end);
+  in.seekg(0, in.beg);
+  in.read((char *)host_array.data(), loaded_data_size);
+
+  size_t in_get_size = in.gcount();
+  if (in_get_size != loaded_data_size) {
+    std::cout << "load weight data error " << in_get_size << ", "
+              << loaded_data_size << ", " << sizeof(DT) << std::endl;
+    assert(false);
+  }
+  assert(size == host_array.size());
+
+  // normal
+  long data_index = 0;
+  for (auto v : host_array) {
+    ptr[data_index++] = v;
+  }
+  in.close();
+}
+
+void FileDataLoader::load_positions(FFModel *ff,
+                                    Tensor pt,
+                                    ParallelTensor position_pt,
+                                    int max_seq_length,
+                                    int offset) {
+  size_t volume = 1;
+  std::vector<int> dims_vec;
+  for (int i = 0; i < pt->num_dims; i++) {
+    volume *= pt->dims[i];
+    dims_vec.push_back(pt->dims[i]);
+  }
+
+  // load data;
+  int *data = (int *)malloc(sizeof(int) * volume);
+  for (int i = 0; i < volume; i++) {
+    data[i] = i % max_seq_length + offset;
+  }
+  // set tensor
+
+  // ParallelTensor position_pt;
+
+  // ff->get_parallel_tensor_from_tensor(pt, position_pt);
+  position_pt->set_tensor<int>(ff, dims_vec, data);
+}
+
+//--------------------- quantization functions ----------------------
+// the data layout is 32 * quantized data + 1 scaling factor + 1 offset factor
+// in the decompression mode, the real data = quantized data * scaling factor +
+// offset
+
+void load_attention_weights_quantized(char *ptr,
+                                      int num_heads,
+                                      size_t hidden_dim,
+                                      size_t qkv_inner_dim,
+                                      std::string layer_name,
+                                      std::string weight_path,
+                                      DataType data_type,
+                                      bool use_full_precision) {
+  // layers_0_attention_wq_weight
+  // layers_0_self_attn_q_proj_weight
+  std::string q_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wq_weight";
+  std::string k_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wk_weight";
+  std::string v_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wv_weight";
+  std::string o_file = weight_path +
+                       layer_name.substr(0, layer_name.find("attention")) +
+                       "attention_wo_weight";
+  std::vector<std::string> weight_files = {q_file, k_file, v_file, o_file};
+
+  int file_index = 0;
+
+  size_t single_proj_size =
+      hidden_dim *
+      qkv_inner_dim; // size of each of Q,K,V,O weights for a single head
+  size_t one_weight_file_size =
+      num_heads * single_proj_size; // size of each of Q/K/V/O for all heads
+
+  // q, k, v, o -> 0, 1, 2, 3
+  for (auto file : weight_files) {
+    size_t partial_size = one_weight_file_size;
+    std::ifstream in(file, std::ios::in | std::ios::binary);
+    if (!in.good()) {
+      std::cout << "Could not open file: " << file << std::endl;
+    }
+    assert(in.good() && "incorrect weight file path");
+    std::vector<char> host_array(partial_size);
+    size_t loaded_data_size = sizeof(char) * partial_size;
+    in.seekg(0, in.end);
+    in.seekg(0, in.beg);
+    in.read((char *)host_array.data(), loaded_data_size);
+    size_t in_get_size = in.gcount();
+
+    if (in_get_size != loaded_data_size) {
+      std::cout << "load data error";
+      return;
+    }
+    assert(partial_size == host_array.size());
+
+    size_t one_head_size = data_type == DT_INT8
+                               ? hidden_dim * (hidden_dim / num_heads)
+                               : hidden_dim * (hidden_dim / num_heads) / 2;
+
+    size_t data_index = 0;
+    for (int i = 0; i < num_heads; i++) {
+      size_t start_index = i * one_head_size * 4 + file_index * one_head_size;
+      for (size_t j = start_index; j < start_index + one_head_size; j++) {
+        if (data_type == DT_INT4) {
+          char v1 = host_array.at(data_index);
+          char v2 = host_array.at(data_index + 1);
+          ptr[j] = (v2 & 0XF) | (v1 << 4);
+          data_index += 2;
+        } else {
+          ptr[j] = host_array.at(data_index);
+          data_index += 1;
+        }
+      }
+    }
+    file_index++;
+    in.close();
+  }
+
+  // load scale and offset to the end of weight tensor
+  // the layout is like |values * 32 heads|offset|scale|
+  size_t offset = data_type == DT_INT8 ? one_weight_file_size * 4
+                                       : (one_weight_file_size * 4) / 2;
+  for (auto file : weight_files) {
+    for (int i = 0; i < 2; i++) {
+      std::string meta_file = i == 0 ? (file + "_offset") : (file + "_scale");
+      size_t partial_size =
+          one_weight_file_size / INT4_NUM_OF_ELEMENTS_PER_GROUP;
+      std::ifstream in(meta_file, std::ios::in | std::ios::binary);
+      if (!in.good()) {
+        std::cout << "Could not open file: " << meta_file << std::endl;
+      }
+      assert(in.good() && "incorrect weight file path");
+
+      if (use_full_precision) {
+        // float
+        std::vector<float> host_array(partial_size);
+        size_t loaded_data_size = sizeof(float) * partial_size;
+        in.seekg(0, in.end);
+        in.seekg(0, in.beg);
+        in.read((char *)host_array.data(), loaded_data_size);
+        size_t in_get_size = in.gcount();
+
+        if (in_get_size != loaded_data_size) {
+          std::cout << "load data error";
+          return;
+        }
+        assert(partial_size == host_array.size());
+
+        for (auto v : host_array) {
+          *(float *)(ptr + offset) = v;
+          offset += sizeof(float);
+        }
+      } else {
+        // half
+        std::vector<half> host_array(partial_size);
+        size_t loaded_data_size = sizeof(half) * partial_size;
+        in.seekg(0, in.end);
+        in.seekg(0, in.beg);
+        in.read((char *)host_array.data(), loaded_data_size);
+        size_t in_get_size = in.gcount();
+
+        if (in_get_size != loaded_data_size) {
+          std::cout << "load data error";
+          return;
+        }
+        assert(partial_size == host_array.size());
+        for (auto v : host_array) {
+          *(half *)(ptr + offset) = v;
+          offset += sizeof(half);
+        }
+      }
+    }
+  }
+}
+
+void load_from_quantized_file(char *ptr,
+                              size_t size,
+                              std::string filename,
+                              DataType data_type,
+                              bool use_full_precision) {
+  assert(data_type == DT_INT4 || data_type == DT_INT8);
+
+  std::string value_file = filename;
+  std::string offset_file = filename + "_offset";
+  std::string scaling_file = filename + "_scale";
+  size_t value_size = 0, offset_size = 0, scaling_size = 0;
+
+  if (data_type == DT_INT4) {
+    // float/half + 4bit quantization
+    // size1 = volume / 2, size2 = volume / 32 * (sizeof(DT)), size3 = size2
+    value_size = 2 * (use_full_precision ? (size * 2 / 3) : (size * 4 / 5));
+    offset_size = use_full_precision ? (size / 6) : (size / 10);
+    scaling_size = use_full_precision ? (size / 6) : (size / 10);
+  } else if (data_type == DT_INT8) {
+    // float/half + 8bit quantization
+    // size1 = volume * 1, size2 = volume / 32 * (sizeof(DT)), size3 = size2
+    value_size = use_full_precision ? (size * 4 / 5) : (size * 8 / 9);
+    offset_size = use_full_precision ? (size / 10) : (size / 18);
+    scaling_size = use_full_precision ? (size / 10) : (size / 18);
+  }
+
+  std::vector<std::string> quantized_files = {
+      value_file, offset_file, scaling_file};
+  std::vector<size_t> quantized_sizes = {value_size, offset_size, scaling_size};
+
+  int file_idx = 0;
+  long data_index = 0;
+  for (auto file : quantized_files) {
+    std::ifstream in(file, std::ios::in | std::ios::binary);
+    if (!in.good()) {
+      std::cout << "Could not open file: " << file << std::endl;
+    }
+    assert(in.good() && "incorrect weight file path");
+
+    // value file, every element is in one byte
+    if (file_idx == 0) {
+      size = quantized_sizes.at(file_idx);
+      std::vector<char> host_array(size);
+      size_t loaded_data_size = size;
+      in.seekg(0, in.end);
+      in.seekg(0, in.beg);
+      in.read((char *)host_array.data(), loaded_data_size);
+
+      size_t in_get_size = in.gcount();
+      if (in_get_size != loaded_data_size) {
+        std::cout << "load weight data error quantized" << in_get_size << ", "
+                  << loaded_data_size << ", " << sizeof(char) << std::endl;
+        return;
+      }
+      assert(size == host_array.size());
+
+      // normal
+      size_t idx = 0;
+      while (idx < host_array.size()) {
+        if (data_type == DT_INT4) {
+          // pack 2 elements into one byte
+          char v1 = host_array.at(idx);
+          char v2 = host_array.at(idx + 1);
+          // v1 in first 4 bit and v2 in last 4 bit;
+          ptr[data_index++] = (v2 & 0XF) | (v1 << 4);
+          idx += 2;
+        } else {
+          ptr[data_index++] = host_array.at(idx++);
+        }
+      }
+    } else if (use_full_precision) {
+      // load offset/scale in float type;
+      size = quantized_sizes.at(file_idx);
+      std::vector<float> host_array(size / sizeof(float));
+      size_t loaded_data_size = size;
+      in.seekg(0, in.end);
+      in.seekg(0, in.beg);
+      in.read((char *)host_array.data(), loaded_data_size);
+
+      size_t in_get_size = in.gcount();
+      if (in_get_size != loaded_data_size) {
+        std::cout << "load weight data error scale/offset" << in_get_size
+                  << ", " << loaded_data_size << ", " << sizeof(float) << ", "
+                  << file << ", " << size << std::endl;
+        return;
+      }
+      assert(size / sizeof(float) == host_array.size());
+      for (auto v : host_array) {
+        *(float *)(ptr + data_index) = v;
+        data_index += sizeof(float);
+      }
+
+    } else {
+      // load offset/scale in half type;
+      size = quantized_sizes.at(file_idx);
+      std::vector<half> host_array(size / sizeof(half));
+      size_t loaded_data_size = size;
+      in.seekg(0, in.end);
+      in.seekg(0, in.beg);
+      in.read((char *)host_array.data(), loaded_data_size);
+
+      size_t in_get_size = in.gcount();
+      if (in_get_size != loaded_data_size) {
+        std::cout << "load weight data error " << in_get_size << ", "
+                  << loaded_data_size << ", " << sizeof(half) << std::endl;
+        return;
+      }
+      assert(size / sizeof(half) == host_array.size());
+      // normal
+      for (auto v : host_array) {
+        *(half *)(ptr + data_index) = v;
+        data_index += sizeof(half);
+      }
+    }
+    in.close();
+    file_idx++;
+  }
+}
+
+void FileDataLoader::load_quantization_weight(FFModel *ff,
+                                              Tensor weight,
+                                              int weight_idx,
+                                              std::string const &layername,
+                                              bool use_full_precision) {
+  size_t volume = 1;
+  std::vector<int> dims_vec;
+  for (int i = 0; i < weight->num_dims; i++) {
+    dims_vec.push_back(weight->dims[i]);
+    volume *= weight->dims[i];
+  }
+
+  char *data = (char *)malloc(sizeof(char) * volume);
+
+  std::string file_path =
+      (layername.back() == '/') ? layername : "/" + layername;
+
+  if (file_path.find("attention_w") != std::string::npos) {
+    if (weight_idx == 0) {
+      load_attention_weights_quantized(data,
+                                       num_heads,
+                                       hidden_dim,
+                                       qkv_inner_dim,
+                                       file_path,
+                                       weight_file_path,
+                                       weight->data_type,
+                                       use_full_precision);
+    }
+    // else {
+    //   load_attention_bias_quantized(data,
+    //                                 num_heads,
+    //                                 hidden_dim,
+    //                                 qkv_inner_dim,
+    //                                 file_path,
+    //                                 weight_file_path);
+    // }
+
+  } else {
+    if (weight_idx > 0) {
+      int index = file_path.find("_weight");
+      assert(index != std::string::npos);
+      file_path = file_path.substr(0, index) + "_bias";
+    }
+    load_from_quantized_file(data,
+                             volume,
+                             weight_file_path + file_path,
+                             weight->data_type,
+                             use_full_precision);
+  }
+
+  ParallelTensor weight_pt;
+  ff->get_parallel_tensor_from_tensor(weight, weight_pt);
+  weight_pt->set_tensor<char>(ff, dims_vec, data);
+
+  delete data;
+}
+
+template <typename DT>
+void FileDataLoader::load_single_weight_tensor(FFModel *ff,
+                                               Tensor weight,
+                                               int weight_idx,
+                                               std::string const &layername) {
+  size_t volume = 1;
+  std::vector<int> dims_vec;
+  for (int i = 0; i < weight->num_dims; i++) {
+    dims_vec.push_back(weight->dims[i]);
+    volume *= weight->dims[i];
+  }
+
+  std::cout << "load weights: " << layername << "\n";
+
+  assert(data_type_size(weight->data_type) == sizeof(DT));
+  DT *data = (DT *)malloc(sizeof(DT) * volume);
+
+  std::string file_path =
+      (layername.back() == '/') ? layername : "/" + layername;
+
+  if (file_path.find("attention_w") != std::string::npos) {
+    if (weight_idx == 0) {
+      load_attention_weights_v2(data,
+                                num_heads,
+                                num_kv_heads,
+                                hidden_dim,
+                                qkv_inner_dim,
+                                file_path,
+                                weight_file_path,
+                                volume,
+                                tensor_parallelism_degree);
+    } else {
+      load_attention_bias_v2(data,
+                             num_heads,
+                             num_kv_heads,
+                             hidden_dim,
+                             qkv_inner_dim,
+                             file_path,
+                             weight_file_path);
+    }
+
+  } else if (file_path.find("self_attention") != std::string::npos) {
+    load_attention_weights_multi_query(
+        data, file_path, weight_file_path, hidden_dim, num_heads);
+  } else {
+    if (weight_idx > 0) {
+      int index = file_path.find("_weight");
+      assert(index != std::string::npos);
+      file_path = file_path.substr(0, index) + "_bias";
+    }
+    load_from_file(data, volume, weight_file_path + file_path);
+  }
+
+  ParallelTensor weight_pt;
+  ff->get_parallel_tensor_from_tensor(weight, weight_pt);
+  weight_pt->set_tensor<DT>(ff, dims_vec, data);
+
+  delete data;
+}
+
+void FileDataLoader::load_weights(
+    FFModel *ff,
+    std::unordered_map<std::string, Layer *> weights_layers,
+    bool use_full_precision) {
+  for (auto &v : weights_layers) {
+    int weights_num = v.second->numWeights;
+    for (int i = 0; i < weights_num; i++) {
+      Tensor weight = v.second->weights[i];
+      if (weight == NULL) {
+        continue;
+      }
+      switch (weight->data_type) {
+        case DT_HALF:
+          load_single_weight_tensor<half>(ff, weight, i, v.first);
+          break;
+        case DT_FLOAT:
+          load_single_weight_tensor<float>(ff, weight, i, v.first);
+          break;
+        case DT_INT4:
+        case DT_INT8:
+          // load weights in quantization
+          load_quantization_weight(ff, weight, i, v.first, use_full_precision);
+          break;
+        default:
+          assert(false && "Unsupported data type");
+      }
+    }
+  }
+}
diff --git a/inference/file_loader.h b/inference/file_loader.h
new file mode 100644
index 0000000000..aaef861d09
--- /dev/null
+++ b/inference/file_loader.h
@@ -0,0 +1,63 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include "flexflow/batch_config.h"
+#include "flexflow/inference.h"
+#include "flexflow/model.h"
+
+using namespace std;
+using namespace FlexFlow;
+
+class FileDataLoader {
+public:
+  FileDataLoader(std::string _input_path,
+                 std::string _weight_file_path,
+                 int _num_heads,
+                 int _num_kv_heads,
+                 size_t _hidden_dim,
+                 size_t _qkv_inner_dim,
+                 int _tensor_parallelism_degree);
+
+  BatchConfig::TokenId *generate_requests(int num, int length);
+
+  template <typename DT>
+  void load_single_weight_tensor(FFModel *ff,
+                                 Tensor weight,
+                                 int weight_idx,
+                                 std::string const &layername);
+
+  void load_quantization_weight(FFModel *ff,
+                                Tensor weight,
+                                int weight_idx,
+                                std::string const &layername,
+                                bool use_full_precision);
+  void load_weights(FFModel *ff,
+                    std::unordered_map<std::string, Layer *> weights_layers,
+                    bool use_full_precision);
+
+  void load_positions(FFModel *ff,
+                      Tensor pt,
+                      ParallelTensor position_pt,
+                      int max_seq_length,
+                      int offset);
+
+private:
+  int num_heads, num_kv_heads, tensor_parallelism_degree;
+  size_t hidden_dim, qkv_inner_dim;
+  std::string input_path;
+  std::string weight_file_path;
+};
diff --git a/inference/incr_decoding/CMakeLists.txt b/inference/incr_decoding/CMakeLists.txt
new file mode 100644
index 0000000000..c3b97d094a
--- /dev/null
+++ b/inference/incr_decoding/CMakeLists.txt
@@ -0,0 +1,37 @@
+cmake_minimum_required(VERSION 3.10)
+
+project(FlexFlow_IncrDecoding)
+set(project_target incr_decoding)
+
+
+set(CPU_SRC
+  ${FLEXFLOW_CPP_DRV_SRC}
+  incr_decoding.cc
+  ../file_loader.cc
+  ../models/llama.cc
+  ../models/opt.cc
+  ../models/falcon.cc
+  ../models/starcoder.cc)
+
+if (FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
+  cuda_add_executable(${project_target} ${CPU_SRC})
+  if (FF_GPU_BACKEND STREQUAL "hip_cuda")
+    target_compile_definitions(${project_target} PRIVATE __HIP_PLATFORM_NVIDIA__)
+  endif()
+elseif(FF_GPU_BACKEND STREQUAL "hip_rocm")
+  hip_add_executable(${project_target} ${CPU_SRC})
+  if (FF_HIP_ARCH STREQUAL "")
+    message(FATAL_ERROR "FF_HIP_ARCH is empty!")
+  endif()
+  set_property(TARGET ${project_target} PROPERTY HIP_ARCHITECTURES "${FF_HIP_ARCH}")
+  target_compile_definitions(${project_target} PRIVATE __HIP_PLATFORM_AMD__)
+else()
+  message(FATAL_ERROR "Compilation of ${project_target} for ${FF_GPU_BACKEND} backend not yet supported")
+endif()
+
+target_include_directories(${project_target} PRIVATE ${FLEXFLOW_INCLUDE_DIRS} ${CMAKE_INSTALL_INCLUDEDIR})
+target_include_directories(${project_target} PRIVATE ${CMAKE_SOURCE_DIR}/inference)
+target_link_libraries(${project_target} -Wl,--whole-archive flexflow -Wl,--no-whole-archive ${FLEXFLOW_EXT_LIBRARIES})
+
+set(BIN_DEST "bin")
+install(TARGETS ${project_target} DESTINATION ${BIN_DEST})
diff --git a/inference/incr_decoding/Makefile b/inference/incr_decoding/Makefile
new file mode 100644
index 0000000000..0e4b79f51f
--- /dev/null
+++ b/inference/incr_decoding/Makefile
@@ -0,0 +1,37 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Flags for directing the runtime makefile what to include
+DEBUG           ?= 0		# Include debugging symbols
+MAX_DIM         ?= 4		# Maximum number of dimensions
+OUTPUT_LEVEL    ?= LEVEL_DEBUG	# Compile time logging level
+USE_CUDA        ?= 1		# Include CUDA support (requires CUDA)
+USE_GASNET      ?= 0		# Include GASNet support (requires GASNet)
+USE_HDF         ?= 1		# Include HDF5 support (requires HDF5)
+ALT_MAPPERS     ?= 0		# Include alternative mappers (not recommended)
+
+# Put the binary file name here
+OUTFILE		?= llama_pipeline
+# List all the application source files here
+ifndef CUDA_HOME
+CUDA_HOME = $(patsubst %/bin/nvcc,%,$(shell which nvcc | head -1))
+endif
+
+
+ifndef FF_HOME
+$(error FF_HOME variable is not defined, aborting build)
+endif
+
+include $(FF_HOME)/FlexFlow.mk
diff --git a/inference/incr_decoding/incr_decoding.cc b/inference/incr_decoding/incr_decoding.cc
new file mode 100644
index 0000000000..10b4744195
--- /dev/null
+++ b/inference/incr_decoding/incr_decoding.cc
@@ -0,0 +1,252 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/inference.h"
+#include "flexflow/request_manager.h"
+#include "models/falcon.h"
+#include "models/llama.h"
+#include "models/opt.h"
+#include "models/starcoder.h"
+#include <wordexp.h>
+
+#include <nlohmann/json.hpp>
+
+using namespace Legion;
+using json = nlohmann::json;
+
+LegionRuntime::Logger::Category log_app("llama");
+
+struct FilePaths {
+  std::string cache_folder_path;
+  std::string prompt_file_path;
+  std::string output_file_path;
+};
+
+void parse_input_args(char **argv,
+                      int argc,
+                      FilePaths &paths,
+                      std::string &llm_model_name,
+                      bool &use_full_precision,
+                      bool &verbose,
+                      bool &do_sample,
+                      float &temperature,
+                      float &topp) {
+  for (int i = 1; i < argc; i++) {
+    // llm model type
+    if (!strcmp(argv[i], "-llm-model")) {
+      llm_model_name = std::string(argv[++i]);
+      for (char &c : llm_model_name) {
+        c = std::tolower(c);
+      }
+      continue;
+    }
+    // cache folder
+    if (!strcmp(argv[i], "-cache-folder")) {
+      paths.cache_folder_path = std::string(argv[++i]);
+      continue;
+    }
+    // prompts
+    if (!strcmp(argv[i], "-prompt")) {
+      paths.prompt_file_path = std::string(argv[++i]);
+      continue;
+    }
+    // output file
+    if (!strcmp(argv[i], "-output-file")) {
+      paths.output_file_path = std::string(argv[++i]);
+      continue;
+    }
+    if (!strcmp(argv[i], "--use-full-precision")) {
+      use_full_precision = true;
+      continue;
+    }
+    // verbose logging to stdout
+    if (!strcmp(argv[i], "--verbose")) {
+      verbose = true;
+      continue;
+    }
+    if (!strcmp(argv[i], "--do-sample")) {
+      do_sample = true;
+      continue;
+    }
+    if (!strcmp(argv[i], "--temperature")) {
+      temperature = std::stof(argv[++i]);
+      continue;
+    }
+    if (!strcmp(argv[i], "--topp")) {
+      topp = std::stof(argv[++i]);
+      continue;
+    }
+  }
+  if (paths.cache_folder_path.empty()) {
+    paths.cache_folder_path = "~/.cache/flexflow";
+  }
+  // Expand ~ to the home directory if needed
+  wordexp_t p;
+  wordexp(paths.cache_folder_path.c_str(), &p, 0);
+  paths.cache_folder_path = p.we_wordv[0];
+  wordfree(&p);
+}
+
+void FlexFlow::top_level_task(Task const *task,
+                              std::vector<PhysicalRegion> const &regions,
+                              Context ctx,
+                              Runtime *runtime) {
+  FFConfig ffconfig;
+  if (ffconfig.cpu_offload == false && ffconfig.quantization_type != DT_NONE) {
+    assert(false && "Doesn't support quantization in non-offload mode");
+  }
+  FilePaths file_paths;
+  std::string llm_model_name;
+  bool use_full_precision = false;
+  bool verbose = false;
+  bool do_sample = false;
+  float temperature = 0.0f;
+  float topp = 0.0f;
+
+  InputArgs const &command_args = HighLevelRuntime::get_input_args();
+  char **argv = command_args.argv;
+  int argc = command_args.argc;
+  parse_input_args(argv,
+                   argc,
+                   file_paths,
+                   llm_model_name,
+                   use_full_precision,
+                   verbose,
+                   do_sample,
+                   temperature,
+                   topp);
+
+  assert(ffconfig.data_parallelism_degree * ffconfig.tensor_parallelism_degree *
+             ffconfig.pipeline_parallelism_degree ==
+         ffconfig.numNodes * ffconfig.workersPerNode);
+
+  std::string config_filepath = join_path(
+      {file_paths.cache_folder_path, "configs", llm_model_name, "config.json"});
+  std::string tokenizer_filepath =
+      join_path({file_paths.cache_folder_path, "tokenizers", llm_model_name});
+  std::string weights_filepath =
+      join_path({file_paths.cache_folder_path,
+                 "weights",
+                 llm_model_name,
+                 use_full_precision ? "full-precision" : "half-precision"});
+  std::ifstream config_file_handle(config_filepath);
+  if (!config_file_handle.good()) {
+    std::cout << "Model config file " << config_filepath << " not found."
+              << std::endl;
+    assert(false);
+  }
+  json model_config = json::parse(config_file_handle,
+                                  /*parser_callback_t */ nullptr,
+                                  /*allow_exceptions */ true,
+                                  /*ignore_comments */ true);
+
+  ModelType model_type = ModelType::UNKNOWN;
+  auto architectures = model_config["architectures"];
+  for (auto const &str : architectures) {
+    if (str == "LlamaForCausalLM" || str == "LLaMAForCausalLM") {
+      std::string nameOrPath = model_config["_name_or_path"];
+      // TODO: support LLAMA-2 models not from Meta
+      bool llama2 = nameOrPath.find("meta-llama/Llama-2") == 0;
+      if (llama2) {
+        model_type = ModelType::LLAMA2;
+      } else {
+        model_type = ModelType::LLAMA;
+      }
+      break;
+    } else if (str == "OPTForCausalLM") {
+      model_type = ModelType::OPT;
+      break;
+    } else if (str == "RWForCausalLM") {
+      model_type = ModelType::FALCON;
+      break;
+    } else if (str == "GPTBigCodeForCausalLM") {
+      model_type = ModelType::STARCODER;
+      break;
+    }
+  }
+  int bos_token_id = model_config["bos_token_id"];
+  int eos_token_id = model_config["eos_token_id"];
+
+  assert(model_type != ModelType::UNKNOWN &&
+         "Invalid LLM model type passed (or no type was passed).");
+
+  GenerationConfig generationConfig(do_sample, temperature, topp);
+  RequestManager *rm = RequestManager::get_request_manager();
+  rm->register_tokenizer(
+      model_type, bos_token_id, eos_token_id, tokenizer_filepath);
+  rm->register_output_filepath(file_paths.output_file_path);
+
+  FFModel model(ffconfig, ffconfig.cpu_offload);
+  if (model_type == ModelType::LLAMA || model_type == ModelType::LLAMA2) {
+    LLAMA::create_llama_model(model,
+                              config_filepath,
+                              weights_filepath,
+                              INC_DECODING_MODE,
+                              generationConfig,
+                              use_full_precision);
+  } else if (model_type == ModelType::OPT) {
+    OPT::create_opt_model(model,
+                          config_filepath,
+                          weights_filepath,
+                          INC_DECODING_MODE,
+                          use_full_precision);
+  } else if (model_type == ModelType::FALCON) {
+    FALCON::create_falcon_model(model,
+                                config_filepath,
+                                weights_filepath,
+                                INC_DECODING_MODE,
+                                use_full_precision);
+  } else if (model_type == ModelType::STARCODER) {
+    STARCODER::create_starcoder_model(model,
+                                      config_filepath,
+                                      weights_filepath,
+                                      INC_DECODING_MODE,
+                                      generationConfig,
+                                      use_full_precision);
+  } else {
+    assert(false && "unknow model type");
+  }
+
+  int total_num_requests = 0;
+  {
+    using json = nlohmann::json;
+    std::ifstream file_handle(file_paths.prompt_file_path);
+    assert(file_handle.good() && "Prompt file does not exist.");
+    json prompt_json = json::parse(file_handle,
+                                   /*parser_callback_t */ nullptr,
+                                   /*allow_exceptions */ true,
+                                   /*ignore_comments */ true);
+    for (auto &prompt : prompt_json) {
+      std::string text = prompt.get<std::string>();
+      printf("Prompt[%d]: %s\n", total_num_requests, text.c_str());
+      total_num_requests++;
+      GenerationResult result =
+          model.generate(text, 128 /*max_sequence_length*/);
+    }
+  }
+
+  // Execution fence
+  {
+    Future future = runtime->issue_execution_fence(ctx);
+    future.get_void_result();
+  }
+
+  // float* data
+  std::cout << "----------inference finished--------------" << std::endl;
+
+  // free tokenizer space in memory
+}
+
+void FlexFlow::register_custom_tasks() {}
diff --git a/inference/models/falcon.cc b/inference/models/falcon.cc
new file mode 100644
index 0000000000..d57504b8cf
--- /dev/null
+++ b/inference/models/falcon.cc
@@ -0,0 +1,210 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "falcon.h"
+
+namespace FlexFlow {
+
+using namespace Legion;
+using json = nlohmann::json;
+
+void FALCON::create_falcon_model(FFModel &ff,
+                                 std::string const &model_config_file_path,
+                                 std::string const &weight_file_path,
+                                 InferenceMode mode,
+                                 bool use_full_precision) {
+  FalconConfig falcon_config(model_config_file_path);
+  falcon_config.print();
+
+  if (ff.config.tensor_parallelism_degree > falcon_config.n_head ||
+      falcon_config.n_head % ff.config.tensor_parallelism_degree != 0 ||
+      ff.config.tensor_parallelism_degree > falcon_config.n_head_kv ||
+      falcon_config.n_head_kv % ff.config.tensor_parallelism_degree != 0) {
+    assert(false && "The number of attention heads is smaller, or it is not "
+                    "divisible by the tensor parallelism degree");
+  }
+
+  std::unordered_map<std::string, Layer *> weights_layers;
+
+  Tensor input;
+  {
+    assert(falcon_config.max_num_tokens <= BatchConfig::MAX_NUM_TOKENS);
+    int const token_dims[] = {BatchConfig::MAX_NUM_TOKENS, 1};
+    input = ff.create_tensor<2>(token_dims, DT_INT32);
+  }
+
+  Initializer *embed_init = new UniformInitializer(std::rand(), 0, 0);
+
+  Tensor token;
+  std::vector<int> axes = {0};
+
+  if (use_full_precision) {
+    token = ff.embedding(input,
+                         falcon_config.vocab_size,
+                         falcon_config.hidden_size,
+                         AGGR_MODE_NONE,
+                         DT_FLOAT,
+                         NULL,
+                         embed_init);
+  } else {
+    token = ff.embedding(input,
+                         falcon_config.vocab_size,
+                         falcon_config.hidden_size,
+                         AGGR_MODE_NONE,
+                         DT_HALF,
+                         NULL,
+                         embed_init);
+  }
+
+  Layer *embedding = ff.layers.back();
+  weights_layers.emplace("word_embeddings_weight", embedding);
+
+  for (int i = 0; i < falcon_config.n_layer; i++) {
+    // set transformer layer id
+    ff.set_transformer_layer_id(i);
+    // step 1: attention
+    Tensor att_norm =
+        ff.layer_norm(token, axes, true, falcon_config.layer_norm_epsilon);
+    Layer *attention_norm = ff.layers.back();
+
+    weights_layers.emplace("layers_" + std::to_string(i) +
+                               "_input_layernorm_weight",
+                           attention_norm);
+    Tensor mha;
+    switch (mode) {
+      case BEAM_SEARCH_MODE: {
+        mha = ff.spec_inc_multiquery_self_attention(
+            att_norm,
+            falcon_config.hidden_size,
+            falcon_config.n_head,
+            falcon_config.n_head_kv,
+            falcon_config.hidden_size / falcon_config.n_head,
+            falcon_config.hidden_size / falcon_config.n_head,
+            0.0f,
+            false,
+            false,
+            false,
+            DT_NONE,
+            NULL,
+            true);
+        break;
+      }
+
+      case TREE_VERIFY_MODE: {
+        mha = ff.inc_multiquery_self_attention_verify(
+            att_norm,
+            falcon_config.hidden_size,
+            falcon_config.n_head,
+            falcon_config.n_head_kv,
+            falcon_config.hidden_size / falcon_config.n_head,
+            falcon_config.hidden_size / falcon_config.n_head,
+            0.0f,    /*dropout*/
+            false,   /*bias*/
+            false,   /*add_bias_kv*/
+            false,   /*add_zero_attn*/
+            DT_NONE, /*data_type*/
+            nullptr, /*kernel_initializer*/
+            true     /*apply_rotary_embedding*/
+        );
+        break;
+      }
+
+      case INC_DECODING_MODE: {
+        mha = ff.inc_multiquery_self_attention(
+            att_norm,
+            falcon_config.hidden_size,
+            falcon_config.n_head,
+            falcon_config.n_head_kv,
+            falcon_config.hidden_size / falcon_config.n_head,
+            falcon_config.hidden_size / falcon_config.n_head,
+            0.0f,    /*dropout*/
+            false,   /*bias*/
+            false,   /*add_bias_kv*/
+            false,   /*add_zero_attn*/
+            DT_NONE, /*data_type*/
+            nullptr, /*kernel_initializer*/
+            true     /*apply_rotary_embedding*/
+        );
+        break;
+      }
+      default: {
+        assert(false);
+      }
+    }
+    Layer *attention_layer = ff.layers.back();
+
+    // multi query
+    //  weights_layers.emplace("layers_" + std::to_string(i) +
+    //                             "_self_attention_dense_weight",
+    //                         attention_layer);
+
+    weights_layers.emplace("layers_" + std::to_string(i) + "_attention_weight",
+                           attention_layer);
+    Tensor dense_h_to_4h =
+        ff.dense(att_norm, falcon_config.hidden_size * 4, AC_MODE_NONE, false);
+    Layer *dense_h_to_4h_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) +
+                               "_mlp_dense_h_to_4h_weight",
+                           dense_h_to_4h_layer);
+    dense_h_to_4h = ff.gelu(dense_h_to_4h);
+    Tensor mlp_output =
+        ff.dense(dense_h_to_4h, falcon_config.hidden_size, AC_MODE_NONE, false);
+    Layer *dense_4h_to_h_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) +
+                               "_mlp_dense_4h_to_h_weight",
+                           dense_4h_to_h_layer);
+
+    token = ff.add(token, mha);
+    token = ff.add(token, mlp_output);
+  }
+  // final normalization and linear
+  Tensor ln_f =
+      ff.layer_norm(token, axes, true, falcon_config.layer_norm_epsilon);
+  Layer *ln_f_layer = ff.layers.back();
+  weights_layers.emplace("ln_f_weight", ln_f_layer);
+
+  Tensor lm_head =
+      ff.dense(ln_f, falcon_config.vocab_size, AC_MODE_NONE, false);
+  Layer *lm_head_layer = ff.layers.back();
+  weights_layers.emplace("lm_head_weight", lm_head_layer);
+
+  Tensor output;
+  if (mode == BEAM_SEARCH_MODE) {
+    Tensor softmax = ff.softmax(lm_head, -1);
+    output = ff.beam_top_k(softmax, falcon_config.max_beam_width, false);
+  } else {
+    output = ff.arg_top_k(lm_head, /*k=*/1, false);
+  }
+
+  // Compile the model
+  std::cout << "------start compile ----------" << std::endl;
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  im->compile_model_and_allocate_buffer(&ff);
+  FileDataLoader fileloader("",
+                            weight_file_path,
+                            falcon_config.n_head,
+                            falcon_config.n_head_kv,
+                            falcon_config.hidden_size,
+                            falcon_config.hidden_size / falcon_config.n_head,
+                            ff.config.tensor_parallelism_degree);
+  std::cout << "------laod weights ----------" << std::endl;
+  fileloader.load_weights(&ff, weights_layers, use_full_precision);
+  std::cout << "------load weight finished----------" << std::endl;
+
+  // init operators
+  im->init_operators_inference(&ff);
+}
+
+}; // namespace FlexFlow
diff --git a/inference/models/falcon.h b/inference/models/falcon.h
new file mode 100644
index 0000000000..a822f9be34
--- /dev/null
+++ b/inference/models/falcon.h
@@ -0,0 +1,95 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include "file_loader.h"
+#include "flexflow/batch_config.h"
+#include "flexflow/inference.h"
+#include "flexflow/request_manager.h"
+#include <nlohmann/json.hpp>
+#include <string>
+using json = nlohmann::json;
+
+namespace FlexFlow {
+
+class FALCON {
+public:
+  struct FalconConfig {
+    FalconConfig(std::string const &model_config_file_path) {
+      std::ifstream config_file(model_config_file_path);
+      if (config_file.is_open()) {
+        try {
+          json model_config;
+          config_file >> model_config;
+          bias = model_config["bias"];
+          hidden_size = model_config["hidden_size"];
+          layer_norm_epsilon = model_config["layer_norm_epsilon"];
+          multi_query = model_config["multi_query"];
+          n_head = model_config["n_head"];
+          if (model_config.contains("n_head_kv")) {
+            n_head_kv = model_config["n_head_kv"];
+          } else {
+            n_head_kv = 1;
+          }
+          n_layer = model_config["n_layer"];
+          parallel_attn = model_config["parallel_attn"];
+          vocab_size = model_config["vocab_size"];
+        } catch (json::exception const &e) {
+          std::cerr << "Error parsing JSON file: " << e.what() << std::endl;
+          assert(false);
+        }
+      } else {
+        std::cerr << "Error opening JSON file " << model_config_file_path
+                  << std::endl;
+        assert(false);
+      }
+      max_seq_len = BatchConfig::MAX_SEQ_LENGTH;
+      max_num_tokens = BatchConfig::MAX_NUM_TOKENS;
+      max_beam_width = BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+      max_beam_depth = BeamSearchBatchConfig::MAX_BEAM_DEPTH;
+    }
+
+    void print() const {
+      std::cout << "Falcon Config:" << std::endl;
+      std::cout << "\tbias: " << bias << std::endl;
+      std::cout << "\thidden_size: " << hidden_size << std::endl;
+      std::cout << "\tlayer_norm_epsilon: " << layer_norm_epsilon << std::endl;
+      std::cout << "\tmulti_query: " << multi_query << std::endl;
+      std::cout << "\tn_head: " << n_head << std::endl;
+      std::cout << "\tn_head_kv: " << n_head << std::endl;
+      std::cout << "\tn_layer: " << n_layer << std::endl;
+      std::cout << "\tparallel_attn: " << parallel_attn << std::endl;
+      std::cout << "\tvocab_size: " << vocab_size << std::endl;
+
+      std::cout << "\tmax_seq_len: " << max_seq_len << std::endl;
+      std::cout << "\tmax_num_tokens: " << max_num_tokens << std::endl;
+      std::cout << "\tmax_beam_width: " << max_beam_width << std::endl;
+      std::cout << "\tmax_beam_depth: " << max_beam_depth << std::endl;
+    }
+
+    bool bias, multi_query, parallel_attn;
+    int hidden_size, n_head, n_head_kv, n_layer, vocab_size;
+    float layer_norm_epsilon;
+    int max_seq_len, max_num_tokens, max_beam_width, max_beam_depth;
+  };
+
+  static void create_falcon_model(FFModel &ff,
+                                  std::string const &model_config_file_path,
+                                  std::string const &weight_file_path,
+                                  InferenceMode mode,
+                                  bool use_full_precision = false);
+};
+
+}; // namespace FlexFlow
diff --git a/inference/models/llama.cc b/inference/models/llama.cc
new file mode 100644
index 0000000000..e2eabec341
--- /dev/null
+++ b/inference/models/llama.cc
@@ -0,0 +1,222 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "llama.h"
+
+namespace FlexFlow {
+
+using namespace Legion;
+using json = nlohmann::json;
+
+void LLAMA::create_llama_model(FFModel &ff,
+                               std::string const &model_config_file_path,
+                               std::string const &weight_file_path,
+                               InferenceMode mode,
+                               GenerationConfig generation_config,
+                               bool use_full_precision) {
+  // do not apply cpu offload in beam search model.
+  LLAMAConfig llama_config(model_config_file_path);
+  llama_config.print();
+
+  if (ff.config.tensor_parallelism_degree > llama_config.num_attention_heads ||
+      llama_config.num_attention_heads % ff.config.tensor_parallelism_degree !=
+          0) {
+    assert(false && "The number of attention heads is smaller, or it is not "
+                    "divisible by the tensor parallelism degree");
+  }
+
+  std::unordered_map<std::string, Layer *> weights_layers;
+
+  Tensor input;
+  {
+    assert(llama_config.max_num_tokens <= BatchConfig::MAX_NUM_TOKENS);
+    int const token_dims[] = {BatchConfig::MAX_NUM_TOKENS, 1};
+    input = ff.create_tensor<2>(token_dims, DT_INT32);
+  }
+
+  Initializer *embed_init = new UniformInitializer(std::rand(), 0, 0);
+
+  Tensor token;
+
+  if (use_full_precision) {
+    token = ff.embedding(input,
+                         llama_config.vocab_size,
+                         llama_config.hidden_size,
+                         AGGR_MODE_NONE,
+                         DT_FLOAT,
+                         NULL,
+                         embed_init);
+  } else {
+    token = ff.embedding(input,
+                         llama_config.vocab_size,
+                         llama_config.hidden_size,
+                         AGGR_MODE_NONE,
+                         DT_HALF,
+                         NULL,
+                         embed_init);
+  }
+
+  Layer *embedding = ff.layers.back();
+  weights_layers.emplace("tok_embeddings_weight", embedding);
+
+  for (int i = 0; i < llama_config.num_hidden_layers; i++) {
+    // set transformer layer id
+    ff.set_transformer_layer_id(i);
+    // step 1: attention
+    Tensor att_norm =
+        ff.rms_norm(token, llama_config.rms_norm_eps, llama_config.hidden_size);
+    Layer *attention_norm = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) +
+                               "_attention_norm_weight",
+                           attention_norm);
+
+    Tensor mha;
+    switch (mode) {
+      case BEAM_SEARCH_MODE: {
+        mha = ff.spec_inc_multihead_self_attention(
+            att_norm,
+            llama_config.hidden_size,
+            llama_config.num_attention_heads,
+            llama_config.hidden_size / llama_config.num_attention_heads,
+            llama_config.hidden_size / llama_config.num_attention_heads,
+            0.0f,
+            false,
+            false,
+            false,
+            DT_NONE,
+            NULL,
+            true);
+        break;
+      }
+      case TREE_VERIFY_MODE: {
+        mha = ff.inc_multihead_self_attention_verify(
+            att_norm,
+            llama_config.hidden_size,
+            llama_config.num_attention_heads,
+            llama_config.hidden_size / llama_config.num_attention_heads,
+            llama_config.hidden_size / llama_config.num_attention_heads,
+            0.0f,    /*dropout*/
+            false,   /*bias*/
+            false,   /*add_bias_kv*/
+            false,   /*add_zero_attn*/
+            DT_NONE, /*data_type*/
+            nullptr, /*kernel_initializer*/
+            true     /*apply_rotary_embedding*/
+        );
+        break;
+      }
+      case INC_DECODING_MODE: {
+        mha = ff.inc_multihead_self_attention(
+            att_norm,
+            llama_config.hidden_size,
+            llama_config.num_attention_heads,
+            llama_config.hidden_size / llama_config.num_attention_heads,
+            llama_config.hidden_size / llama_config.num_attention_heads,
+            0.0f,    /*dropout*/
+            false,   /*bias*/
+            false,   /*add_bias_kv*/
+            false,   /*add_zero_attn*/
+            DT_NONE, /*data_type*/
+            nullptr, /*kernel_initializer*/
+            true     /*apply_rotary_embedding*/
+        );
+        break;
+      }
+      default: {
+        assert(false);
+      }
+    }
+    Layer *attention_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_attention_weight",
+                           attention_layer);
+    token = ff.add(token, mha);
+
+    // step 2: SILU activaion
+    Tensor ff_norm =
+        ff.rms_norm(token, llama_config.rms_norm_eps, llama_config.hidden_size);
+    Layer *ffn_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_ffn_norm_weight",
+                           ffn_layer);
+
+    Tensor w1 =
+        ff.dense(ff_norm, llama_config.intermediate_size, AC_MODE_NONE, false);
+    Layer *w1_layer = ff.layers.back();
+    weights_layers.emplace(
+        "layers_" + std::to_string(i) + "_feed_forward_w1_weight", w1_layer);
+
+    Tensor w3 =
+        ff.dense(ff_norm, llama_config.intermediate_size, AC_MODE_NONE, false);
+    Layer *w3_layer = ff.layers.back();
+    weights_layers.emplace(
+        "layers_" + std::to_string(i) + "_feed_forward_w3_weight", w3_layer);
+
+    Tensor sigmoid = ff.sigmoid(w1);
+    Tensor silu = ff.multiply(w1, sigmoid);
+    Tensor multi = ff.multiply(silu, w3);
+
+    Tensor w2 = ff.dense(multi, llama_config.hidden_size, AC_MODE_NONE, false);
+    Layer *w2_layer = ff.layers.back();
+    weights_layers.emplace(
+        "layers_" + std::to_string(i) + "_feed_forward_w2_weight", w2_layer);
+    token = ff.add(token, w2);
+  }
+  // final normalization and linear
+  std::vector<int> axes = {2};
+  token =
+      ff.rms_norm(token, llama_config.rms_norm_eps, llama_config.hidden_size);
+  Layer *final_norm = ff.layers.back();
+  weights_layers.emplace("norm_weight", final_norm);
+
+  Tensor dense = ff.dense(token, llama_config.vocab_size, AC_MODE_NONE, false);
+  Layer *final_linear = ff.layers.back();
+  weights_layers.emplace("output_weight", final_linear);
+
+  Tensor output;
+  if (mode == BEAM_SEARCH_MODE) {
+    Tensor softmax = ff.softmax(dense, -1);
+    // output = ff.beam_top_k(softmax, llama_config.max_beam_width, false);
+    output = ff.argmax(softmax, /*beam_Search*/ true);
+  } else {
+    // Tensor softmax = ff.softmax(dense, -1);
+    if (generation_config.do_sample) {
+      dense = ff.scalar_truediv(dense, generation_config.temperature, false);
+      Tensor softmax = ff.softmax(dense, -1);
+      output = ff.sampling(softmax, generation_config.topp);
+    } else {
+      // output = ff.arg_top_k(dense, /*k=*/1, false);
+      output = ff.argmax(dense, /*beam_Search*/ false);
+    }
+  }
+
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  // Compile the model
+  std::cout << "------start compile ----------" << std::endl;
+  im->compile_model_and_allocate_buffer(&ff);
+  FileDataLoader fileloader("",
+                            weight_file_path,
+                            llama_config.num_attention_heads,
+                            llama_config.num_attention_heads,
+                            llama_config.hidden_size,
+                            llama_config.hidden_size /
+                                llama_config.num_attention_heads,
+                            ff.config.tensor_parallelism_degree);
+  fileloader.load_weights(&ff, weights_layers, use_full_precision);
+  std::cout << "------load weight finished----------" << std::endl;
+
+  // init operators
+  im->init_operators_inference(&ff);
+}
+
+}; // namespace FlexFlow
diff --git a/inference/models/llama.h b/inference/models/llama.h
new file mode 100644
index 0000000000..f01a7dbd52
--- /dev/null
+++ b/inference/models/llama.h
@@ -0,0 +1,88 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include "file_loader.h"
+#include "flexflow/batch_config.h"
+#include "flexflow/inference.h"
+#include "flexflow/request_manager.h"
+#include <nlohmann/json.hpp>
+#include <string>
+using json = nlohmann::json;
+
+namespace FlexFlow {
+
+class LLAMA {
+public:
+  struct LLAMAConfig {
+    LLAMAConfig(std::string const &model_config_file_path) {
+      std::ifstream config_file(model_config_file_path);
+      if (config_file.is_open()) {
+        try {
+          json model_config;
+          config_file >> model_config;
+          num_hidden_layers = model_config["num_hidden_layers"];
+          vocab_size = model_config["vocab_size"];
+          num_attention_heads = model_config["num_attention_heads"];
+          hidden_size = model_config["hidden_size"];
+          rms_norm_eps = model_config["rms_norm_eps"];
+          intermediate_size = model_config["intermediate_size"];
+        } catch (json::exception const &e) {
+          std::cerr << "Error parsing LLAMA config from JSON file: " << e.what()
+                    << std::endl;
+          assert(false);
+        }
+      } else {
+        std::cerr << "Error opening JSON file " << model_config_file_path
+                  << std::endl;
+        assert(false);
+      }
+      max_seq_len = BatchConfig::MAX_SEQ_LENGTH;
+      max_num_tokens = BatchConfig::MAX_NUM_TOKENS;
+      max_beam_width = BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+      max_beam_depth = BeamSearchBatchConfig::MAX_BEAM_DEPTH;
+    }
+
+    void print() const {
+      std::cout << "LLAMA Config:" << std::endl;
+      std::cout << "\tnum_hidden_layers: " << num_hidden_layers << std::endl;
+      std::cout << "\tvocab_size: " << vocab_size << std::endl;
+      std::cout << "\tnum_attention_heads: " << num_attention_heads
+                << std::endl;
+      std::cout << "\thidden_size: " << hidden_size << std::endl;
+      std::cout << "\trms_norm_eps: " << rms_norm_eps << std::endl;
+      std::cout << "\tintermediate_size: " << intermediate_size << std::endl;
+
+      std::cout << "\tmax_seq_len: " << max_seq_len << std::endl;
+      std::cout << "\tmax_num_tokens: " << max_num_tokens << std::endl;
+      std::cout << "\tmax_beam_width: " << max_beam_width << std::endl;
+      std::cout << "\tmax_beam_depth: " << max_beam_depth << std::endl;
+    }
+
+    int max_seq_len, max_num_tokens, max_beam_width, max_beam_depth;
+    int num_hidden_layers, vocab_size, num_attention_heads, hidden_size,
+        intermediate_size;
+    float rms_norm_eps;
+  };
+
+  static void create_llama_model(FFModel &ff,
+                                 std::string const &model_config_file_path,
+                                 std::string const &weight_file_path,
+                                 InferenceMode mode,
+                                 GenerationConfig generation_config,
+                                 bool use_full_precision = false);
+};
+
+}; // namespace FlexFlow
diff --git a/inference/models/opt.cc b/inference/models/opt.cc
new file mode 100644
index 0000000000..9b3670ed89
--- /dev/null
+++ b/inference/models/opt.cc
@@ -0,0 +1,250 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "opt.h"
+
+namespace FlexFlow {
+
+using namespace Legion;
+using json = nlohmann::json;
+
+void OPT::create_opt_model(FFModel &ff,
+                           std::string const &model_config_file_path,
+                           std::string const &weight_file_path,
+                           InferenceMode mode,
+                           bool use_full_precision) {
+  OPTConfig opt_config(model_config_file_path);
+  opt_config.print();
+
+  if (ff.config.tensor_parallelism_degree > opt_config.num_attention_heads ||
+      opt_config.num_attention_heads % ff.config.tensor_parallelism_degree !=
+          0) {
+    assert(false && "The number of attention heads is smaller, or it is not "
+                    "divisible by the tensor parallelism degree");
+  }
+
+  std::unordered_map<std::string, Layer *> weights_layers;
+
+  //------------------------------ build the model --------------------------
+  Tensor input;
+  Tensor position_input;
+  ff.set_position_offset(2);
+  {
+    int const token_dims[] = {BatchConfig::MAX_NUM_TOKENS, 1};
+    input = ff.create_tensor<2>(token_dims, DT_INT32);
+    position_input = ff.create_tensor<2>(token_dims, DT_INT32);
+  }
+
+  Initializer *embed_init = new UniformInitializer(std::rand(), 0, 0);
+  std::vector<int> axes = {0};
+
+  Tensor token;
+  if (use_full_precision) {
+    token = ff.embedding(input,
+                         opt_config.vocab_size,
+                         opt_config.word_embed_proj_dim,
+                         AGGR_MODE_NONE,
+                         DT_FLOAT,
+                         NULL,
+                         embed_init);
+  } else {
+    token = ff.embedding(input,
+                         opt_config.vocab_size,
+                         opt_config.word_embed_proj_dim,
+                         AGGR_MODE_NONE,
+                         DT_HALF,
+                         NULL,
+                         embed_init);
+  }
+
+  Layer *embedding = ff.layers.back();
+  weights_layers.emplace("embed_tokens_weight", embedding);
+
+  Tensor positional_embedding;
+  if (use_full_precision) {
+    positional_embedding = ff.embedding(position_input,
+                                        opt_config.max_position_embeddings,
+                                        opt_config.hidden_size,
+                                        AGGR_MODE_NONE,
+                                        DT_FLOAT,
+                                        NULL,
+                                        embed_init);
+  } else {
+    positional_embedding = ff.embedding(position_input,
+                                        opt_config.max_position_embeddings,
+                                        opt_config.hidden_size,
+                                        AGGR_MODE_NONE,
+                                        DT_HALF,
+                                        NULL,
+                                        embed_init);
+  }
+  Layer *pos_embedding = ff.layers.back();
+  weights_layers.emplace("embed_positions_weight", pos_embedding);
+
+  Tensor residual = ff.add(token, positional_embedding);
+
+  for (int i = 0; i < opt_config.num_hidden_layers; i++) {
+    // set transformer layer id
+    ff.set_transformer_layer_id(i);
+
+    // 125m, 1.7B, ..., 175B applies layer norm BEFORE attention,
+    // 350m applies layer norm AFTER attention
+    // https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#LL324C1-L325C1
+    // this version is before normalization
+
+    Tensor hidden_states = ff.layer_norm(
+        residual, axes, opt_config.layer_norm_elementwise_affine, 1e-05);
+    Layer *self_attn_layer_norm = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) +
+                               "_attention_layer_norm_weight",
+                           self_attn_layer_norm);
+
+    Tensor mha;
+    switch (mode) {
+      case BEAM_SEARCH_MODE: {
+        mha = ff.spec_inc_multihead_self_attention(
+            hidden_states,
+            opt_config.hidden_size,
+            opt_config.num_attention_heads,
+            opt_config.hidden_size / opt_config.num_attention_heads,
+            opt_config.hidden_size / opt_config.num_attention_heads,
+            0.0f,
+            true,
+            false,
+            false,
+            DT_NONE, /*data_type*/
+            NULL,
+            false,
+            /*scaling query*/ true,
+            /*scaling factor*/
+            pow((opt_config.hidden_size / opt_config.num_attention_heads),
+                -0.5),
+            /*qk_prod_scaling*/ false);
+        break;
+      }
+      case TREE_VERIFY_MODE: {
+        mha = ff.inc_multihead_self_attention_verify(
+            hidden_states,
+            opt_config.hidden_size,
+            opt_config.num_attention_heads,
+            opt_config.hidden_size / opt_config.num_attention_heads,
+            opt_config.hidden_size / opt_config.num_attention_heads,
+            0.0f,
+            true,
+            false,
+            false,
+            DT_NONE, /*data_type*/
+            NULL,
+            false,
+            /*scaling query*/ true,
+            /*scaling factor*/
+            pow((opt_config.hidden_size / opt_config.num_attention_heads),
+                -0.5),
+            /*qk_prod_scaling*/ false);
+        break;
+      }
+      case INC_DECODING_MODE: {
+        mha = ff.inc_multihead_self_attention(
+            hidden_states,
+            opt_config.hidden_size,
+            opt_config.num_attention_heads,
+            opt_config.hidden_size / opt_config.num_attention_heads,
+            opt_config.hidden_size / opt_config.num_attention_heads,
+            0.0f,
+            true,
+            false,
+            false,
+            DT_NONE, /*data_type*/
+            NULL,
+            false,
+            /*scaling query*/ true,
+            /*scaling factor*/
+            pow((opt_config.hidden_size / opt_config.num_attention_heads),
+                -0.5),
+            /*qk_prod_scaling*/ false);
+        break;
+      }
+      default: {
+        assert(false);
+      }
+    }
+
+    Layer *attention_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_attention_weight",
+                           attention_layer);
+
+    Tensor added = ff.add(mha, residual);
+
+    Tensor final_norm = ff.layer_norm(
+        added, axes, opt_config.layer_norm_elementwise_affine, 1e-05);
+    Layer *final_layer_norm = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) +
+                               "_final_layer_norm_weight",
+                           final_layer_norm);
+
+    //--------linear fc1 fc2 ----------
+    Tensor fc1 = ff.dense(final_norm, opt_config.ffn_dim, AC_MODE_NONE, true);
+    Layer *fc1_linear = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_fc1_weight",
+                           fc1_linear);
+    Tensor activation = ff.relu(fc1, false);
+
+    Tensor fc2 =
+        ff.dense(activation, opt_config.hidden_size, AC_MODE_NONE, true);
+    Layer *fc2_linear = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_fc2_weight",
+                           fc2_linear);
+    residual = ff.add(added, fc2);
+  }
+
+  // final
+  Tensor all_final_norm = ff.layer_norm(
+      residual, axes, opt_config.layer_norm_elementwise_affine, 1e-05);
+  Layer *all_final_norm_layer = ff.layers.back();
+  weights_layers.emplace("final_layer_norm_weight", all_final_norm_layer);
+
+  Tensor lm_head =
+      ff.dense(all_final_norm, opt_config.vocab_size, AC_MODE_NONE, false);
+  Layer *lm_head_layer = ff.layers.back();
+  weights_layers.emplace("embed_tokens_weight_lm_head", lm_head_layer);
+
+  Tensor output;
+  if (mode == BEAM_SEARCH_MODE) {
+    Tensor softmax = ff.softmax(lm_head, -1);
+    // output = ff.beam_top_k(softmax, opt_config.max_beam_width, false);
+    output = ff.argmax(softmax, /*beam_Search*/ true);
+  } else {
+    // output = ff.arg_top_k(lm_head, /*k=*/1, false);
+    output = ff.argmax(lm_head, /*beam_Search*/ false);
+  }
+
+  //------------------- compile the model --------------------------------
+  std::cout << "------start compile ----------" << std::endl;
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  im->compile_model_and_allocate_buffer(&ff);
+  FileDataLoader fileloader("",
+                            weight_file_path,
+                            opt_config.num_attention_heads,
+                            opt_config.num_attention_heads,
+                            opt_config.hidden_size,
+                            opt_config.hidden_size /
+                                opt_config.num_attention_heads,
+                            ff.config.tensor_parallelism_degree);
+  fileloader.load_weights(&ff, weights_layers, use_full_precision);
+  std::cout << "------finished loading weights----------" << std::endl;
+  im->init_operators_inference(&ff);
+}
+
+}; // namespace FlexFlow
diff --git a/inference/models/opt.h b/inference/models/opt.h
new file mode 100644
index 0000000000..ab972ae10c
--- /dev/null
+++ b/inference/models/opt.h
@@ -0,0 +1,102 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include "file_loader.h"
+#include "flexflow/batch_config.h"
+#include "flexflow/inference.h"
+#include "flexflow/request_manager.h"
+#include <nlohmann/json.hpp>
+#include <string>
+using json = nlohmann::json;
+
+namespace FlexFlow {
+
+class OPT {
+public:
+  struct OPTConfig {
+    OPTConfig(std::string const &model_config_file_path) {
+      std::ifstream config_file(model_config_file_path);
+      if (config_file.is_open()) {
+        try {
+          json model_config;
+          config_file >> model_config;
+          do_layer_norm_before = model_config["do_layer_norm_before"];
+          dropout = model_config["dropout"];
+          enable_bias = model_config["enable_bias"];
+          ffn_dim = model_config["ffn_dim"];
+          hidden_size = model_config["hidden_size"];
+          layer_norm_elementwise_affine =
+              model_config["layer_norm_elementwise_affine"];
+          max_position_embeddings = model_config["max_position_embeddings"];
+          num_attention_heads = model_config["num_attention_heads"];
+          num_hidden_layers = model_config["num_hidden_layers"];
+          vocab_size = model_config["vocab_size"];
+          word_embed_proj_dim = model_config["word_embed_proj_dim"];
+        } catch (json::exception const &e) {
+          std::cerr << "Error parsing JSON file: " << e.what() << std::endl;
+          assert(false);
+        }
+      } else {
+        std::cerr << "Error opening JSON file " << model_config_file_path
+                  << std::endl;
+        assert(false);
+      }
+      max_seq_len = BatchConfig::MAX_SEQ_LENGTH;
+      max_num_tokens = BatchConfig::MAX_NUM_TOKENS;
+      max_beam_width = BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+      max_beam_depth = BeamSearchBatchConfig::MAX_BEAM_DEPTH;
+    }
+
+    void print() const {
+      std::cout << "OPT Config:" << std::endl;
+      std::cout << "\tdo_layer_norm_before: " << do_layer_norm_before
+                << std::endl;
+      std::cout << "\tdropout: " << dropout << std::endl;
+      std::cout << "\tenable_bias: " << enable_bias << std::endl;
+      std::cout << "\tffn_dim: " << ffn_dim << std::endl;
+      std::cout << "\thidden_size: " << hidden_size << std::endl;
+      std::cout << "\tlayer_norm_elementwise_affine: "
+                << layer_norm_elementwise_affine << std::endl;
+      std::cout << "\tmax_position_embeddings: " << max_position_embeddings
+                << std::endl;
+      std::cout << "\tnum_attention_heads: " << num_attention_heads
+                << std::endl;
+      std::cout << "\tnum_hidden_layers: " << num_hidden_layers << std::endl;
+      std::cout << "\tvocab_size: " << vocab_size << std::endl;
+      std::cout << "\tword_embed_proj_dim: " << word_embed_proj_dim
+                << std::endl;
+
+      std::cout << "\tmax_seq_len: " << max_seq_len << std::endl;
+      std::cout << "\tmax_num_tokens: " << max_num_tokens << std::endl;
+      std::cout << "\tmax_beam_width: " << max_beam_width << std::endl;
+      std::cout << "\tmax_beam_depth: " << max_beam_depth << std::endl;
+    }
+
+    int max_seq_len, max_num_tokens, max_beam_width, max_beam_depth;
+    bool do_layer_norm_before, enable_bias, layer_norm_elementwise_affine;
+    float dropout;
+    int ffn_dim, hidden_size, max_position_embeddings, num_attention_heads,
+        num_hidden_layers, vocab_size, word_embed_proj_dim;
+  };
+
+  static void create_opt_model(FFModel &ff,
+                               std::string const &model_config_file_path,
+                               std::string const &weight_file_path,
+                               InferenceMode mode,
+                               bool use_full_precision = false);
+};
+
+}; // namespace FlexFlow
diff --git a/inference/models/starcoder.cc b/inference/models/starcoder.cc
new file mode 100644
index 0000000000..4b27498cfd
--- /dev/null
+++ b/inference/models/starcoder.cc
@@ -0,0 +1,214 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "starcoder.h"
+
+namespace FlexFlow {
+
+using namespace Legion;
+using json = nlohmann::json;
+
+void STARCODER::create_starcoder_model(
+    FFModel &ff,
+    std::string const &model_config_file_path,
+    std::string const &weight_file_path,
+    InferenceMode mode,
+    GenerationConfig generationConfig,
+    bool use_full_precision) {
+  // do not apply cpu offload in beam search model.
+  STARCODERConfig startcoder_config(model_config_file_path);
+  startcoder_config.print();
+
+  if (ff.config.tensor_parallelism_degree >
+          startcoder_config.num_attention_heads ||
+      startcoder_config.num_attention_heads %
+              ff.config.tensor_parallelism_degree !=
+          0) {
+    assert(false && "The number of attention heads is smaller, or it is not "
+                    "divisible by the tensor parallelism degree");
+  }
+
+  std::unordered_map<std::string, Layer *> weights_layers;
+  std::vector<int> axes = {0};
+
+  Tensor input;
+  Tensor position_input;
+  ff.set_position_offset(0);
+  {
+    assert(startcoder_config.max_num_tokens <= BatchConfig::MAX_NUM_TOKENS);
+    int const token_dims[] = {BatchConfig::MAX_NUM_TOKENS, 1};
+    input = ff.create_tensor<2>(token_dims, DT_INT32);
+    position_input = ff.create_tensor<2>(token_dims, DT_INT32);
+  }
+
+  Initializer *embed_init = new UniformInitializer(std::rand(), 0, 0);
+
+  Tensor token;
+
+  if (use_full_precision) {
+    token = ff.embedding(input,
+                         startcoder_config.vocab_size,
+                         startcoder_config.hidden_size,
+                         AGGR_MODE_NONE,
+                         DT_FLOAT,
+                         NULL,
+                         embed_init);
+  } else {
+    token = ff.embedding(input,
+                         startcoder_config.vocab_size,
+                         startcoder_config.hidden_size,
+                         AGGR_MODE_NONE,
+                         DT_HALF,
+                         NULL,
+                         embed_init);
+  }
+
+  Layer *embedding = ff.layers.back();
+  weights_layers.emplace("transformer_wte_weight", embedding);
+
+  Tensor positional_embedding;
+  if (use_full_precision) {
+    positional_embedding =
+        ff.embedding(position_input,
+                     startcoder_config.max_position_embeddings,
+                     startcoder_config.hidden_size,
+                     AGGR_MODE_NONE,
+                     DT_FLOAT,
+                     NULL,
+                     embed_init);
+  } else {
+    positional_embedding =
+        ff.embedding(position_input,
+                     startcoder_config.max_position_embeddings,
+                     startcoder_config.hidden_size,
+                     AGGR_MODE_NONE,
+                     DT_HALF,
+                     NULL,
+                     embed_init);
+  }
+  Layer *pos_embedding = ff.layers.back();
+  weights_layers.emplace("transformer_wpe_weight", pos_embedding);
+
+  Tensor hidden_states = ff.add(token, positional_embedding);
+
+  for (int i = 0; i < startcoder_config.num_hidden_layers; i++) {
+    // set transformer layer id
+    ff.set_transformer_layer_id(i);
+    // step 1: attention
+    Tensor ln_1 = ff.layer_norm(
+        hidden_states, axes, true, startcoder_config.layer_norm_epsilon);
+    Layer *layer_norm = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_ln_1_weight",
+                           layer_norm);
+
+    Tensor mha;
+    switch (mode) {
+      case INC_DECODING_MODE: {
+        mha = ff.inc_multiquery_self_attention(
+            ln_1,
+            startcoder_config.hidden_size,
+            startcoder_config.num_attention_heads,
+            1,
+            startcoder_config.hidden_size /
+                startcoder_config.num_attention_heads,
+            startcoder_config.hidden_size /
+                startcoder_config.num_attention_heads,
+            startcoder_config.dropout_p, /*dropout*/
+            true,                        /*bias*/
+            false,                       /*add_bias_kv*/
+            false,                       /*add_zero_attn*/
+            DT_NONE,                     /*data_type*/
+            nullptr,                     /*kernel_initializer*/
+            false                        /*apply_rotary_embedding*/
+        );
+        break;
+      }
+      default: {
+        assert(false);
+      }
+    }
+    Layer *attention_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_attention_weight",
+                           attention_layer);
+    Tensor residual = ff.add(hidden_states, mha);
+
+    Tensor l2_norm = ff.layer_norm(
+        residual, axes, true, startcoder_config.layer_norm_epsilon);
+    Layer *l2_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_ln_2_weight",
+                           l2_layer);
+
+    // mlp
+    Tensor c_fc = ff.dense(
+        l2_norm, startcoder_config.intermediate_size, AC_MODE_NONE, true);
+    Layer *c_fc_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_mlp_c_fc_weight",
+                           c_fc_layer);
+    c_fc = ff.gelu(c_fc);
+
+    Tensor c_proj =
+        ff.dense(c_fc, startcoder_config.hidden_size, AC_MODE_NONE, true);
+    Layer *c_proj_layer = ff.layers.back();
+    weights_layers.emplace("layers_" + std::to_string(i) + "_mlp_c_proj_weight",
+                           c_proj_layer);
+
+    hidden_states = ff.add(residual, c_proj);
+  }
+  // final normalization and linear
+  Tensor ln_f = ff.layer_norm(
+      hidden_states, axes, true, startcoder_config.layer_norm_epsilon);
+  Layer *final_norm = ff.layers.back();
+  weights_layers.emplace("transformer_ln_f_weight", final_norm);
+
+  Tensor lm_head =
+      ff.dense(ln_f, startcoder_config.vocab_size, AC_MODE_NONE, false);
+  Layer *final_linear = ff.layers.back();
+  weights_layers.emplace("lm_head_weight", final_linear);
+
+  Tensor output;
+  if (mode == BEAM_SEARCH_MODE) {
+    Tensor softmax = ff.softmax(lm_head, -1);
+    output = ff.argmax(softmax, /*beam_Search*/ true);
+  } else {
+    // Tensor softmax = ff.softmax(dense, -1);
+    if (generationConfig.do_sample) {
+      lm_head = ff.scalar_truediv(lm_head, generationConfig.temperature, false);
+      Tensor softmax = ff.softmax(lm_head, -1);
+      output = ff.sampling(softmax, generationConfig.topp);
+    } else {
+      output = ff.argmax(lm_head, /*beam_Search*/ false);
+    }
+  }
+
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  // Compile the model
+  std::cout << "------start compile ----------" << std::endl;
+  im->compile_model_and_allocate_buffer(&ff);
+  FileDataLoader fileloader("",
+                            weight_file_path,
+                            startcoder_config.num_attention_heads,
+                            1,
+                            startcoder_config.hidden_size,
+                            startcoder_config.hidden_size /
+                                startcoder_config.num_attention_heads,
+                            ff.config.tensor_parallelism_degree);
+  fileloader.load_weights(&ff, weights_layers, use_full_precision);
+  std::cout << "------load weight finished----------" << std::endl;
+
+  // init operators
+  im->init_operators_inference(&ff);
+}
+
+}; // namespace FlexFlow
diff --git a/inference/models/starcoder.h b/inference/models/starcoder.h
new file mode 100644
index 0000000000..9789a1c36e
--- /dev/null
+++ b/inference/models/starcoder.h
@@ -0,0 +1,76 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include "file_loader.h"
+#include "flexflow/batch_config.h"
+#include "flexflow/inference.h"
+#include "flexflow/request_manager.h"
+#include <nlohmann/json.hpp>
+#include <string>
+using json = nlohmann::json;
+
+namespace FlexFlow {
+
+class STARCODER {
+public:
+  struct STARCODERConfig {
+    STARCODERConfig(std::string const &model_config_file_path) {
+      std::ifstream config_file(model_config_file_path);
+      if (config_file.is_open()) {
+        try {
+          json model_config;
+          config_file >> model_config;
+          num_hidden_layers = model_config["n_layer"];
+          vocab_size = model_config["vocab_size"];
+          num_attention_heads = model_config["n_head"];
+          hidden_size = model_config["n_embd"];
+          layer_norm_epsilon = model_config["layer_norm_epsilon"];
+          intermediate_size = model_config["n_inner"];
+          dropout_p = model_config["attn_pdrop"];
+          max_position_embeddings = model_config["n_positions"];
+        } catch (json::exception const &e) {
+          std::cerr << "Error parsing STARCODER config from JSON file: "
+                    << e.what() << std::endl;
+          assert(false);
+        }
+      } else {
+        std::cerr << "Error opening JSON file " << model_config_file_path
+                  << std::endl;
+        assert(false);
+      }
+      max_seq_len = BatchConfig::MAX_SEQ_LENGTH;
+      max_num_tokens = BatchConfig::MAX_NUM_TOKENS;
+      max_beam_width = BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+      max_beam_depth = BeamSearchBatchConfig::MAX_BEAM_DEPTH;
+    }
+
+    void print() const {}
+
+    int max_seq_len, max_num_tokens, max_beam_width, max_beam_depth;
+    int num_hidden_layers, vocab_size, num_attention_heads, hidden_size,
+        intermediate_size, max_position_embeddings;
+    float layer_norm_epsilon, dropout_p;
+  };
+
+  static void create_starcoder_model(FFModel &ff,
+                                     std::string const &model_config_file_path,
+                                     std::string const &weight_file_path,
+                                     InferenceMode mode,
+                                     GenerationConfig generationConfig,
+                                     bool use_full_precision = false);
+};
+
+}; // namespace FlexFlow
diff --git a/inference/python/incr_decoding.py b/inference/python/incr_decoding.py
new file mode 100644
index 0000000000..1ed7791143
--- /dev/null
+++ b/inference/python/incr_decoding.py
@@ -0,0 +1,115 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import flexflow.serve as ff
+import argparse, json, os
+from types import SimpleNamespace
+
+
+def get_configs():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-config-file",
+        help="The path to a JSON file with the configs. If omitted, a sample model and configs will be used instead.",
+        type=str,
+        default="",
+    )
+    args = parser.parse_args()
+
+    # Load configs from JSON file (if specified)
+    if len(args.config_file) > 0:
+        if not os.path.isfile(args.config_file):
+            raise FileNotFoundError(f"Config file {args.config_file} not found.")
+        try:
+            with open(args.config_file) as f:
+                return json.load(f)
+        except json.JSONDecodeError as e:
+            print("JSON format error:")
+            print(e)
+    else:
+        # Define sample configs
+        ff_init_configs = {
+            # required parameters
+            "num_gpus": 4,
+            "memory_per_gpu": 14000,
+            "zero_copy_memory_per_node": 30000,
+            # optional parameters
+            "num_cpus": 4,
+            "legion_utility_processors": 4,
+            "data_parallelism_degree": 1,
+            "tensor_parallelism_degree": 1,
+            "pipeline_parallelism_degree": 4,
+            "offload": False,
+            "offload_reserve_space_size": 1024**2,
+            "use_4bit_quantization": False,
+            "use_8bit_quantization": False,
+            "profiling": False,
+            "fusion": True,
+        }
+        llm_configs = {
+            # required parameters
+            "llm_model": "tiiuae/falcon-7b",
+            # optional parameters
+            "cache_path": "",
+            "refresh_cache": False,
+            "full_precision": True,
+            "prompt": "",
+            "output_file": "",
+        }
+        # Merge dictionaries
+        ff_init_configs.update(llm_configs)
+        return ff_init_configs
+
+
+def main():
+    configs_dict = get_configs()
+    configs = SimpleNamespace(**configs_dict)
+
+    # Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
+    ff.init(configs_dict)
+
+    # Create the FlexFlow LLM
+    ff_data_type = (
+        ff.DataType.DT_FLOAT if configs.full_precision else ff.DataType.DT_HALF
+    )
+    llm = ff.LLM(
+        configs.llm_model,
+        data_type=ff_data_type,
+        cache_path=configs.cache_path,
+        refresh_cache=configs.refresh_cache,
+        output_file=configs.output_file,
+    )
+
+    # Compile the LLM for inference and load the weights into memory
+    generation_config = ff.GenerationConfig(
+        do_sample=False, temperature=0.9, topp=0.8, topk=1
+    )
+    llm.compile(
+        generation_config,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+    )
+
+    # Generation begins!
+    if len(configs.prompt) > 0:
+        prompts = [s for s in json.load(open(configs.prompt))]
+        results = llm.generate(prompts)
+    else:
+        result = llm.generate("Here are some travel tips for Tokyo:\n")
+
+
+if __name__ == "__main__":
+    print("flexflow inference example (incremental decoding)")
+    main()
diff --git a/inference/python/spec_infer.py b/inference/python/spec_infer.py
new file mode 100644
index 0000000000..192960b533
--- /dev/null
+++ b/inference/python/spec_infer.py
@@ -0,0 +1,161 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import flexflow.serve as ff
+import argparse, json, os
+from types import SimpleNamespace
+
+
+def get_configs():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-config-file",
+        help="The path to a JSON file with the configs. If omitted, a sample model and configs will be used instead.",
+        type=str,
+        default="",
+    )
+    args = parser.parse_args()
+
+    # Load configs from JSON file (if specified)
+    if len(args.config_file) > 0:
+        if not os.path.isfile(args.config_file):
+            raise FileNotFoundError(f"Config file {args.config_file} not found.")
+        try:
+            with open(args.config_file) as f:
+                return json.load(f)
+        except json.JSONDecodeError as e:
+            print("JSON format error:")
+            print(e)
+    else:
+        # Define sample configs
+        ff_init_configs = {
+            # required parameters
+            "num_gpus": 4,
+            "memory_per_gpu": 14000,
+            "zero_copy_memory_per_node": 30000,
+            # optional parameters
+            "num_cpus": 4,
+            "legion_utility_processors": 4,
+            "data_parallelism_degree": 1,
+            "tensor_parallelism_degree": 2,
+            "pipeline_parallelism_degree": 2,
+            "offload": False,
+            "offload_reserve_space_size": 1024**2,
+            "use_4bit_quantization": False,
+            "use_8bit_quantization": False,
+            "profiling": False,
+            "fusion": True,
+        }
+        llm_configs = {
+            # required llm arguments
+            "llm_model": "decapoda-research/llama-7b-hf",
+            # optional llm parameters
+            "cache_path": "",
+            "refresh_cache": False,
+            "full_precision": False,
+            "ssms": [
+                {
+                    # required ssm parameter
+                    "ssm_model": "JackFram/llama-160m",
+                    # optional ssm parameters
+                    "cache_path": "",
+                    "refresh_cache": False,
+                    "full_precision": False,
+                },
+                {
+                    # required ssm parameter
+                    "ssm_model": "facebook/opt-125m",
+                    # optional ssm parameters
+                    "cache_path": "",
+                    "refresh_cache": False,
+                    "full_precision": False,
+                },
+            ],
+            "prompt": "../prompt/test.json",
+            "output_file": "",
+        }
+        # Merge dictionaries
+        ff_init_configs.update(llm_configs)
+        return ff_init_configs
+
+
+def main():
+    configs_dict = get_configs()
+    configs = SimpleNamespace(**configs_dict)
+
+    # Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
+    ff.init(configs_dict)
+
+    # Create the FlexFlow LLM
+    ff_data_type = (
+        ff.DataType.DT_FLOAT if configs.full_precision else ff.DataType.DT_HALF
+    )
+    llm = ff.LLM(
+        configs.llm_model,
+        data_type=ff_data_type,
+        cache_path=configs.cache_path,
+        refresh_cache=configs.refresh_cache,
+        output_file=configs.output_file,
+    )
+
+    # Create the SSMs
+    ssms = []
+    for ssm_config in configs.ssms:
+        ssm_config = SimpleNamespace(**ssm_config)
+        ff_data_type = (
+            ff.DataType.DT_FLOAT if ssm_config.full_precision else ff.DataType.DT_HALF
+        )
+        ssm = ff.SSM(
+            ssm_config.ssm_model,
+            data_type=ff_data_type,
+            cache_path=ssm_config.cache_path,
+            refresh_cache=ssm_config.refresh_cache,
+            output_file=configs.output_file,
+        )
+        ssms.append(ssm)
+
+    # Create the sampling configs
+    generation_config = ff.GenerationConfig(
+        do_sample=False, temperature=0.9, topp=0.8, topk=1
+    )
+
+    # Compile the SSMs for inference and load the weights into memory
+    for ssm in ssms:
+        ssm.compile(
+            generation_config,
+            max_batch_size=1,
+            max_seq_length=256,
+            max_tokens_per_batch=64,
+        )
+
+    # Compile the LLM for inference and load the weights into memory
+    llm.compile(
+        generation_config,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+        ssms=ssms,
+    )
+
+    # Generation begins!
+    if len(configs.prompt) > 0:
+        prompts = [s for s in json.load(open(configs.prompt))]
+        results = llm.generate(prompts)
+    else:
+        result = llm.generate("Here are some travel tips for Tokyo:\n")
+
+
+if __name__ == "__main__":
+    print("flexflow inference example (speculative inference)")
+    main()
diff --git a/inference/spec_infer/CMakeLists.txt b/inference/spec_infer/CMakeLists.txt
new file mode 100644
index 0000000000..3d6b48b802
--- /dev/null
+++ b/inference/spec_infer/CMakeLists.txt
@@ -0,0 +1,36 @@
+cmake_minimum_required(VERSION 3.10)
+
+project(FlexFlow_SpecInfer)
+set(project_target spec_infer)
+
+
+set(CPU_SRC
+  ${FLEXFLOW_CPP_DRV_SRC}
+  spec_infer.cc
+  ../file_loader.cc
+  ../models/llama.cc
+  ../models/opt.cc
+  ../models/falcon.cc)
+
+if (FF_GPU_BACKEND STREQUAL "cuda" OR FF_GPU_BACKEND STREQUAL "hip_cuda")
+  cuda_add_executable(${project_target} ${CPU_SRC})
+  if (FF_GPU_BACKEND STREQUAL "hip_cuda")
+    target_compile_definitions(${project_target} PRIVATE __HIP_PLATFORM_NVIDIA__)
+  endif()
+elseif(FF_GPU_BACKEND STREQUAL "hip_rocm")
+  hip_add_executable(${project_target} ${CPU_SRC})
+  if (FF_HIP_ARCH STREQUAL "")
+    message(FATAL_ERROR "FF_HIP_ARCH is empty!")
+  endif()
+  set_property(TARGET ${project_target} PROPERTY HIP_ARCHITECTURES "${FF_HIP_ARCH}")
+  target_compile_definitions(${project_target} PRIVATE __HIP_PLATFORM_AMD__)
+else()
+  message(FATAL_ERROR "Compilation of ${project_target} for ${FF_GPU_BACKEND} backend not yet supported")
+endif()
+
+target_include_directories(${project_target} PRIVATE ${FLEXFLOW_INCLUDE_DIRS} ${CMAKE_INSTALL_INCLUDEDIR})
+target_include_directories(${project_target} PRIVATE ${CMAKE_SOURCE_DIR}/inference)
+target_link_libraries(${project_target} -Wl,--whole-archive flexflow -Wl,--no-whole-archive ${FLEXFLOW_EXT_LIBRARIES})
+
+set(BIN_DEST "bin")
+install(TARGETS ${project_target} DESTINATION ${BIN_DEST})
diff --git a/inference/spec_infer/Makefile b/inference/spec_infer/Makefile
new file mode 100644
index 0000000000..0e4b79f51f
--- /dev/null
+++ b/inference/spec_infer/Makefile
@@ -0,0 +1,37 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Flags for directing the runtime makefile what to include
+DEBUG           ?= 0		# Include debugging symbols
+MAX_DIM         ?= 4		# Maximum number of dimensions
+OUTPUT_LEVEL    ?= LEVEL_DEBUG	# Compile time logging level
+USE_CUDA        ?= 1		# Include CUDA support (requires CUDA)
+USE_GASNET      ?= 0		# Include GASNet support (requires GASNet)
+USE_HDF         ?= 1		# Include HDF5 support (requires HDF5)
+ALT_MAPPERS     ?= 0		# Include alternative mappers (not recommended)
+
+# Put the binary file name here
+OUTFILE		?= llama_pipeline
+# List all the application source files here
+ifndef CUDA_HOME
+CUDA_HOME = $(patsubst %/bin/nvcc,%,$(shell which nvcc | head -1))
+endif
+
+
+ifndef FF_HOME
+$(error FF_HOME variable is not defined, aborting build)
+endif
+
+include $(FF_HOME)/FlexFlow.mk
diff --git a/inference/spec_infer/spec_infer.cc b/inference/spec_infer/spec_infer.cc
new file mode 100644
index 0000000000..16eab8d077
--- /dev/null
+++ b/inference/spec_infer/spec_infer.cc
@@ -0,0 +1,370 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/inference.h"
+#include "models/falcon.h"
+#include "models/llama.h"
+#include "models/opt.h"
+#include <filesystem>
+#include <nlohmann/json.hpp>
+#include <wordexp.h>
+
+using namespace Legion;
+using json = nlohmann::json;
+
+LegionRuntime::Logger::Category log_app("llama");
+
+struct FilePaths {
+  std::string cache_folder_path;
+  std::string prompt_file_path;
+  std::string output_file_path;
+};
+
+struct ModelNames {
+  std::string llm_model_name;
+  std::vector<std::string> ssm_model_names;
+};
+
+struct ModelMeta {
+  ModelNames model_names;
+
+  ModelType llm_model_type;
+  std::string llm_tokenizer_path;
+  std::string llm_weights_path;
+  std::string llm_model_config_path;
+
+  int bos_token_id, eos_token_id;
+
+  std::vector<ModelType> ssm_model_types;
+  std::vector<std::string> ssm_model_config_paths;
+  std::vector<std::string> ssm_model_weights_paths;
+};
+
+void parse_input_args(char **argv,
+                      int argc,
+                      FilePaths &paths,
+                      ModelNames &model_names,
+                      bool &use_full_precision,
+                      bool &verbose) {
+  for (int i = 1; i < argc; i++) {
+    // llm model name
+    if (!strcmp(argv[i], "-llm-model")) {
+      model_names.llm_model_name = std::string(argv[++i]);
+      for (char &c : model_names.llm_model_name) {
+        c = std::tolower(c);
+      }
+      continue;
+    }
+    // ssm models names
+    if (!strcmp(argv[i], "-ssm-model")) {
+      std::string ssm_model_name = std::string(argv[++i]);
+      for (char &c : ssm_model_name) {
+        c = std::tolower(c);
+      }
+      model_names.ssm_model_names.push_back(ssm_model_name);
+      continue;
+    }
+    // cache folder
+    if (!strcmp(argv[i], "-cache-folder")) {
+      paths.cache_folder_path = std::string(argv[++i]);
+      continue;
+    }
+    // prompts
+    if (!strcmp(argv[i], "-prompt")) {
+      paths.prompt_file_path = std::string(argv[++i]);
+      continue;
+    }
+    // output file
+    if (!strcmp(argv[i], "-output-file")) {
+      paths.output_file_path = std::string(argv[++i]);
+      continue;
+    }
+    if (!strcmp(argv[i], "--use-full-precision")) {
+      use_full_precision = true;
+      continue;
+    }
+    // verbose logging to stdout
+    if (!strcmp(argv[i], "--verbose")) {
+      verbose = true;
+      continue;
+    }
+  }
+  if (paths.cache_folder_path.empty()) {
+    paths.cache_folder_path = "~/.cache/flexflow";
+  }
+  // Expand ~ to the home directory if needed
+  wordexp_t p;
+  wordexp(paths.cache_folder_path.c_str(), &p, 0);
+  paths.cache_folder_path = p.we_wordv[0];
+  wordfree(&p);
+}
+
+void get_model_meta(FilePaths &file_paths,
+                    ModelMeta &model_metadata,
+                    bool use_full_precision) {
+  if (model_metadata.model_names.llm_model_name.empty() ||
+      model_metadata.model_names.ssm_model_names.size() == 0) {
+    assert(false && "SpecInfer needs at least one LLM and one SSM for "
+                    "speculative inference");
+  }
+  model_metadata.llm_model_config_path =
+      join_path({file_paths.cache_folder_path,
+                 "configs",
+                 model_metadata.model_names.llm_model_name,
+                 "config.json"});
+  model_metadata.llm_tokenizer_path =
+      join_path({file_paths.cache_folder_path,
+                 "tokenizers",
+                 model_metadata.model_names.llm_model_name});
+  model_metadata.llm_weights_path =
+      join_path({file_paths.cache_folder_path,
+                 "weights",
+                 model_metadata.model_names.llm_model_name,
+                 use_full_precision ? "full-precision" : "half-precision"});
+
+  std::ifstream llm_config_file_handle(model_metadata.llm_model_config_path);
+  if (!llm_config_file_handle.good()) {
+    std::cout << "LLM Model config file "
+              << model_metadata.llm_model_config_path << " not found."
+              << std::endl;
+    assert(false);
+  }
+  json llm_model_config = json::parse(llm_config_file_handle,
+                                      /*parser_callback_t */ nullptr,
+                                      /*allow_exceptions */ true,
+                                      /*ignore_comments */ true);
+
+  model_metadata.llm_model_type = ModelType::UNKNOWN;
+  auto architectures = llm_model_config["architectures"];
+  for (auto const &str : architectures) {
+    if (str == "LlamaForCausalLM" || str == "LLaMAForCausalLM") {
+      std::string nameOrPath = llm_model_config["_name_or_path"];
+      // TODO: support LLAMA-2 models not from Meta
+      bool llama2 = nameOrPath.find("meta-llama/Llama-2") == 0;
+      if (llama2) {
+        model_metadata.llm_model_type = ModelType::LLAMA2;
+      } else {
+        model_metadata.llm_model_type = ModelType::LLAMA;
+      }
+      break;
+    } else if (str == "OPTForCausalLM") {
+      model_metadata.llm_model_type = ModelType::OPT;
+      break;
+    } else if (str == "RWForCausalLM") {
+      model_metadata.llm_model_type = ModelType::FALCON;
+      break;
+    }
+  }
+  model_metadata.bos_token_id = llm_model_config["bos_token_id"];
+  model_metadata.eos_token_id = llm_model_config["eos_token_id"];
+
+  for (auto ssm_model_name : model_metadata.model_names.ssm_model_names) {
+    std::string ssm_config_path = join_path({file_paths.cache_folder_path,
+                                             "configs",
+                                             ssm_model_name,
+                                             "config.json"});
+    std::string ssm_tokenizer_path =
+        join_path({file_paths.cache_folder_path, "tokenizers", ssm_model_name});
+    std::string ssm_weights_path =
+        join_path({file_paths.cache_folder_path,
+                   "weights",
+                   ssm_model_name,
+                   use_full_precision ? "full-precision" : "half-precision"});
+
+    std::ifstream ssm_config_file_handle(ssm_config_path);
+    if (!ssm_config_file_handle.good()) {
+      std::cout << "SSM Model config file " << ssm_config_path << " not found."
+                << std::endl;
+      assert(false);
+    }
+    json ssm_model_config = json::parse(ssm_config_file_handle,
+                                        /*parser_callback_t */ nullptr,
+                                        /*allow_exceptions */ true,
+                                        /*ignore_comments */ true);
+
+    ModelType ssm_model_type = ModelType::UNKNOWN;
+    auto architectures = ssm_model_config["architectures"];
+    for (auto const &str : architectures) {
+      if (str == "LlamaForCausalLM" || str == "LLaMAForCausalLM") {
+        std::string nameOrPath = ssm_model_config["_name_or_path"];
+        // TODO: support LLAMA-2 models not from Meta
+        bool llama2 = nameOrPath.find("meta-llama/Llama-2") == 0;
+        if (llama2) {
+          ssm_model_type = ModelType::LLAMA2;
+        } else {
+          ssm_model_type = ModelType::LLAMA;
+        }
+        break;
+      } else if (str == "OPTForCausalLM") {
+        ssm_model_type = ModelType::OPT;
+        break;
+      } else if (str == "RWForCausalLM") {
+        ssm_model_type = ModelType::FALCON;
+        break;
+      }
+    }
+    if (ssm_model_config["bos_token_id"] != model_metadata.bos_token_id ||
+        ssm_model_config["eos_token_id"] != model_metadata.eos_token_id) {
+      printf("Warning: bos/eos token id mismatch between LLM and one of the "
+             "SSMs!\n");
+    }
+    model_metadata.ssm_model_types.push_back(ssm_model_type);
+    model_metadata.ssm_model_config_paths.push_back(ssm_config_path);
+    model_metadata.ssm_model_weights_paths.push_back(ssm_weights_path);
+  }
+
+  assert(model_metadata.llm_model_type != ModelType::UNKNOWN &&
+         "Invalid LLM model type passed (or no type was passed).");
+
+  for (auto mt : model_metadata.ssm_model_types) {
+    if (mt == ModelType::UNKNOWN) {
+      assert(false && "One of the SSM model types passed is invalid.");
+    }
+  }
+}
+
+void FlexFlow::top_level_task(Task const *task,
+                              std::vector<PhysicalRegion> const &regions,
+                              Context ctx,
+                              Runtime *runtime) {
+  FFConfig ffconfig;
+  FilePaths file_paths;
+  ModelMeta model_metadata;
+  bool use_full_precision = false;
+  bool verbose = false;
+
+  InputArgs const &command_args = HighLevelRuntime::get_input_args();
+  char **argv = command_args.argv;
+  int argc = command_args.argc;
+  parse_input_args(argv,
+                   argc,
+                   file_paths,
+                   model_metadata.model_names,
+                   use_full_precision,
+                   verbose);
+
+  get_model_meta(file_paths, model_metadata, use_full_precision);
+
+  assert(ffconfig.data_parallelism_degree * ffconfig.tensor_parallelism_degree *
+             ffconfig.pipeline_parallelism_degree ==
+         ffconfig.numNodes * ffconfig.workersPerNode);
+
+  // Create SentencePiece tokenizer or OPT tokenizer
+  GenerationConfig generationConfig;
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  RequestManager *rm = RequestManager::get_request_manager();
+  rm->register_tokenizer(model_metadata.llm_model_type,
+                         model_metadata.bos_token_id,
+                         model_metadata.eos_token_id,
+                         model_metadata.llm_tokenizer_path);
+  rm->register_output_filepath(file_paths.output_file_path);
+
+  // Create LLM model
+  FFModel tree_model(ffconfig, ffconfig.cpu_offload);
+  if (model_metadata.llm_model_type == ModelType::LLAMA ||
+      model_metadata.llm_model_type == ModelType::LLAMA2) {
+    LLAMA::create_llama_model(tree_model,
+                              model_metadata.llm_model_config_path,
+                              model_metadata.llm_weights_path,
+                              TREE_VERIFY_MODE,
+                              generationConfig,
+                              use_full_precision);
+  } else if (model_metadata.llm_model_type == ModelType::OPT) {
+    OPT::create_opt_model(tree_model,
+                          model_metadata.llm_model_config_path,
+                          model_metadata.llm_weights_path,
+                          TREE_VERIFY_MODE,
+                          use_full_precision);
+  } else if (model_metadata.llm_model_type == ModelType::FALCON) {
+    FALCON::create_falcon_model(tree_model,
+                                model_metadata.llm_model_config_path,
+                                model_metadata.llm_weights_path,
+                                TREE_VERIFY_MODE,
+                                use_full_precision);
+  } else {
+    assert(false && "Invalid LLM model type passed (or no type was passed).");
+  }
+
+  // Create SSM models
+  int num_ssms = model_metadata.ssm_model_types.size();
+  std::vector<int> ssm_model_ids;
+  std::vector<FFModel> ssm_models;
+  FFConfig bm_config = ffconfig;
+  bm_config.data_parallelism_degree = bm_config.tensor_parallelism_degree =
+      bm_config.pipeline_parallelism_degree = 1;
+  for (int ssm_id = 0; ssm_id < num_ssms; ssm_id++) {
+    FFModel beam_model(bm_config);
+    ssm_models.push_back(beam_model);
+  }
+
+  for (int ssm_id = 0; ssm_id < num_ssms; ssm_id++) {
+    FFModel &beam_model = ssm_models[ssm_id];
+    if (model_metadata.ssm_model_types[ssm_id] == ModelType::LLAMA ||
+        model_metadata.ssm_model_types[ssm_id] == ModelType::LLAMA2) {
+      LLAMA::create_llama_model(beam_model,
+                                model_metadata.ssm_model_config_paths[ssm_id],
+                                model_metadata.ssm_model_weights_paths[ssm_id],
+                                BEAM_SEARCH_MODE,
+                                generationConfig,
+                                use_full_precision);
+    } else if (model_metadata.ssm_model_types[ssm_id] == ModelType::OPT) {
+      OPT::create_opt_model(beam_model,
+                            model_metadata.ssm_model_config_paths[ssm_id],
+                            model_metadata.ssm_model_weights_paths[ssm_id],
+                            BEAM_SEARCH_MODE,
+                            use_full_precision);
+    } else if (model_metadata.ssm_model_types[ssm_id] == ModelType::FALCON) {
+      FALCON::create_falcon_model(
+          beam_model,
+          model_metadata.ssm_model_config_paths[ssm_id],
+          model_metadata.ssm_model_weights_paths[ssm_id],
+          BEAM_SEARCH_MODE,
+          use_full_precision);
+    } else {
+      assert(false && "Invalid SSM model type passed.");
+    }
+
+    rm->register_ssm_model(&beam_model);
+  }
+
+  // Register requests from prompt file
+  int total_num_requests = 0;
+  {
+    using json = nlohmann::json;
+    std::ifstream file_handle(file_paths.prompt_file_path);
+    assert(file_handle.good() && "Prompt file does not exist.");
+    json prompt_json = json::parse(file_handle,
+                                   /*parser_callback_t */ nullptr,
+                                   /*allow_exceptions */ true,
+                                   /*ignore_comments */ true);
+    for (auto &prompt : prompt_json) {
+      std::string text = prompt.get<std::string>();
+      printf("Prompt[%d]: %s\n", total_num_requests, text.c_str());
+      total_num_requests++;
+      tree_model.generate(text, 128 /*max_sequence_length*/);
+    }
+  }
+
+  // Execution fence
+  {
+    Future future = runtime->issue_execution_fence(ctx);
+    future.get_void_result();
+  }
+
+  // float* data
+  std::cout << "----------inference finished--------------" << std::endl;
+}
+
+void FlexFlow::register_custom_tasks() {}
diff --git a/inference/utils/compress_llama_weights.py b/inference/utils/compress_llama_weights.py
new file mode 100644
index 0000000000..c92ae6aca9
--- /dev/null
+++ b/inference/utils/compress_llama_weights.py
@@ -0,0 +1,117 @@
+import torch
+import numpy as np
+from transformers import AutoModelForCausalLM
+import dataclasses
+
+@dataclasses.dataclass
+class CompressionConfig:
+    """Group-wise quantization."""
+    num_bits: int
+    group_size: int
+    group_dim: int
+    symmetric: bool
+    enabled: bool = True
+    
+def compress(tensor, config):
+    """Simulate group-wise quantization."""
+    if not config.enabled:
+        return tensor
+
+    group_size, num_bits, group_dim, symmetric = (
+        config.group_size, config.num_bits, config.group_dim, config.symmetric)
+    assert num_bits <= 8
+
+    original_shape = tensor.shape
+    num_groups = (original_shape[group_dim] + group_size - 1) // group_size
+    new_shape = (original_shape[:group_dim] + (num_groups, group_size) +
+                 original_shape[group_dim+1:])
+
+    # Pad
+    pad_len = (group_size - original_shape[group_dim] % group_size) % group_size
+    if pad_len != 0:
+        pad_shape = original_shape[:group_dim] + (pad_len,) + original_shape[group_dim+1:]
+        tensor = torch.cat([
+            tensor,
+            torch.zeros(pad_shape, dtype=tensor.dtype, device=tensor.device)],
+            dim=group_dim)
+    data = tensor.view(new_shape)
+
+    # Quantize
+    if symmetric:
+        B = 2 ** (num_bits - 1) - 1
+        scale = B / torch.max(data.abs(), dim=group_dim + 1, keepdim=True)[0]
+        data = data * scale
+        data = data.clamp_(-B, B).round_().to(torch.int8)
+        return data, scale, original_shape
+    else:
+        B = 2 ** num_bits - 1
+        # print('max value')
+        # print(B)
+        mn = torch.min(data, dim=group_dim + 1, keepdim=True)[0]
+        mx = torch.max(data, dim=group_dim + 1, keepdim=True)[0]
+
+        scale = B / (mx - mn)
+        data = data - mn
+        data.mul_(scale)
+
+        data = data.clamp_(0, B).round_().to(torch.uint8)
+        return data, mn, scale, original_shape
+
+
+def decompress(packed_data, config):
+    """Simulate group-wise dequantization."""
+    if not config.enabled:
+        return packed_data
+
+    group_size, num_bits, group_dim, symmetric = (
+        config.group_size, config.num_bits, config.group_dim, config.symmetric)
+
+    # Dequantize
+    if symmetric:
+        data, scale, original_shape = packed_data
+        data = data / scale
+    else:
+        data, mn, scale, original_shape = packed_data
+        data = data / scale
+        data.add_(mn)
+    
+    # Unpad
+    pad_len = (group_size - original_shape[group_dim] % group_size) % group_size
+    if pad_len:
+        padded_original_shape = (
+            original_shape[:group_dim] +
+            (original_shape[group_dim] + pad_len,) +
+            original_shape[group_dim+1:])
+        data = data.reshape(padded_original_shape)
+        indices = [slice(0, x) for x in original_shape]
+        return data[indices].contiguous()
+    else:
+        return data.view(original_shape)
+
+if __name__ == "__main__":
+    # torch.set_default_tensor_type(torch.HalfTensor)
+    # torch.set_default_tensor_type(torch.cuda.HalfTensor)
+    model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
+    config = CompressionConfig(
+        num_bits=8, group_size=32, group_dim=0, symmetric=False)
+    for name, params in model.named_parameters():
+        name = (
+            name.replace(".", "_")
+            .replace("self_attn", "attention")
+            .replace("q_proj", "wq")
+            .replace("k_proj", "wk")
+            .replace("v_proj", "wv")
+            .replace("o_proj", "wo")
+            .replace("mlp", "feed_forward")
+            .replace("gate_proj", "w1")
+            .replace("down_proj", "w2")
+            .replace("up_proj", "w3")
+            .replace("input_layernorm", "attention_norm")
+            .replace("post_attention_layernorm", "ffn_norm")
+            .replace("embed_tokens", "tok_embeddings")
+            .replace("lm_head", "output")
+            .replace("model_", "")
+        )        
+        if "feed_forward" in name or "output" in name or "attention_w" in name:
+            data, mn, scale, original_shape = compress(params, config)
+            
\ No newline at end of file
diff --git a/inference/utils/download_hf_model.py b/inference/utils/download_hf_model.py
new file mode 100644
index 0000000000..689730f32b
--- /dev/null
+++ b/inference/utils/download_hf_model.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python
+import flexflow.serve as ff
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "model_names", type=str, nargs="+", help="Name of the model(s) to download"
+    )
+    parser.add_argument(
+        "--cache-folder",
+        type=str,
+        help="Folder to use to store the model(s) assets in FlexFlow format",
+        default="",
+    )
+    parser.add_argument(
+        "--refresh-cache",
+        action="store_true",
+        help="Use this flag to force the refresh of the model(s) weights/tokenizer cache",
+    )
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument(
+        "--full-precision-only",
+        action="store_true",
+        help="Only download the full precision version of the weights",
+    )
+    group.add_argument(
+        "--half-precision-only",
+        action="store_true",
+        help="Only download the half precision version of the weights",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main(args):
+    # Initialize FF serve to gain access to its utils
+    ff.init_cpu()
+
+    if args.full_precision_only:
+        data_types = ff.DataType.DT_FLOAT
+    elif args.half_precision_only:
+        data_types = ff.DataType.DT_HALF
+    else:
+        data_types = (ff.DataType.DT_FLOAT, ff.DataType.DT_HALF)
+
+    for model_name in args.model_names:
+        for data_type in data_types:
+            llm = ff.LLM(
+                model_name,
+                data_type=data_type,
+                cache_path=args.cache_folder,
+                refresh_cache=args.refresh_cache,
+            )
+            llm.download_hf_weights_if_needed()
+            llm.download_hf_tokenizer_if_needed()
+            llm.download_hf_config()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
diff --git a/python/Makefile b/python/Makefile
index 07beab86f3..2edee4f0d1 100644
--- a/python/Makefile
+++ b/python/Makefile
@@ -30,6 +30,8 @@ FF_USE_PYTHON   := 1
 SHARED_OBJECTS  := 1     # we build the shared lib for legion
 # FF_PYTHON_USE_INDEX_LOADER = 1
 
+INSTALL_TOKENIZERS := $(shell $(FF_HOME)/scripts/install_tokenizer.sh)
+
 ifeq ($(shell uname -s), Darwin)
   PYTHON_EXT := dylib
 else
diff --git a/python/flexflow/core/__init__.py b/python/flexflow/core/__init__.py
index 1ad4746cca..5b421a74ed 100644
--- a/python/flexflow/core/__init__.py
+++ b/python/flexflow/core/__init__.py
@@ -62,12 +62,12 @@
   flexflow_library.initialize()
 
   # check which python binding to use
-  if flexflow_python_binding() == 'pybind11':
-    print("Using pybind11 flexflow bindings.")
-    from .flexflow_pybind11 import *
+  if flexflow_python_binding() == "pybind11":
+      print("Using pybind11 flexflow bindings.")
+      from .flexflow_pybind11 import *
   else:
-    print("Using cffi flexflow bindings.")
-    from .flexflow_cffi import *
+      print("Using cffi flexflow bindings.")
+      from .flexflow_cffi import *
 
 else:
-  pass
\ No newline at end of file
+  pass
diff --git a/python/flexflow/core/flexflow_cffi.py b/python/flexflow/core/flexflow_cffi.py
index 4eab2155cf..1508371ae7 100644
--- a/python/flexflow/core/flexflow_cffi.py
+++ b/python/flexflow/core/flexflow_cffi.py
@@ -22,7 +22,8 @@
 import warnings
 import numpy as np
 from .flexflow_logger import fflogger
-from flexflow.type import ActiMode, RegularizerMode, AggrMode, PoolType, DataType, LossType, CompMode, MetricsType, OpType, ParameterSyncType, enum_to_int, int_to_enum
+from flexflow.type import ActiMode, RegularizerMode, AggrMode, PoolType, DataType, LossType, CompMode, MetricsType, InferenceMode, ModelType, OpType, ParameterSyncType, enum_to_int, int_to_enum
+
 _FF_BUILD_DOCS = bool(os.environ.get('READTHEDOCS') or os.environ.get("FF_BUILD_DOCS"))
 if not _FF_BUILD_DOCS:
   from .flexflowlib import ffi, flexflow_library
@@ -39,6 +40,8 @@ def get_c_name(name):
     return ffi.new("char[]", name.encode('ascii'))
 
 def get_datatype_size(datatype):
+  if (datatype == DataType.DT_HALF):
+    return 2
   if (datatype == DataType.DT_FLOAT):
     return 4
   elif (datatype == DataType.DT_DOUBLE):
@@ -428,6 +431,62 @@ class MultiHeadAttention(Op):
   def __init__(self, handle, idx=None, name=None):
     super(MultiHeadAttention, self).__init__(handle, idx, name)
 
+# -----------------------------------------------------------------------
+# Incremental MultiHeadAttention
+# -----------------------------------------------------------------------
+class IncMultiHeadAttention(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(IncMultiHeadAttention, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# Speculative Incremental MultiHeadAttention
+# -----------------------------------------------------------------------
+class SpecIncMultiHeadSelfAttention(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(SpecIncMultiHeadSelfAttention, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# TreeVerify Incremental MultiHeadAttention
+# -----------------------------------------------------------------------
+class TreeIncMultiHeadSelfAttention(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(TreeIncMultiHeadSelfAttention, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# RMS Norm
+# -----------------------------------------------------------------------
+class RMSNorm(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(RMSNorm, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# ArgTopK
+# -----------------------------------------------------------------------
+class ArgTopK(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(ArgTopK, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# BeamTopK
+# -----------------------------------------------------------------------
+class BeamTopK(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(BeamTopK, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# Sampling
+# -----------------------------------------------------------------------
+class Sampling(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(Sampling, self).__init__(handle, idx, name)
+
+# -----------------------------------------------------------------------
+# ArgMax
+# -----------------------------------------------------------------------
+class ArgMax(Op):
+  def __init__(self, handle, idx=None, name=None):
+    super(ArgMax, self).__init__(handle, idx, name)
+
 # -----------------------------------------------------------------------
 # flexflow_op_t handle to Op
 # -----------------------------------------------------------------------
@@ -507,7 +566,23 @@ def convert_op_handle_to_op(op_type, handle, idx=None, name=None):
   elif op_type == OpType.REVERSE:
     return Reverse(handle, idx, name)
   elif op_type == OpType.MULTIHEAD_ATTENTION:
-    return Reverse(handle, idx, name)
+    return MultiHeadAttention(handle, idx, name)
+  elif op_type == OpType.INC_MULTIHEAD_ATTENTION:
+    return IncMultiHeadAttention(handle, idx, name)
+  elif op_type == OpType.SPEC_INC_MULTIHEAD_SELF_ATTENTION:
+    return SpecIncMultiHeadSelfAttention(handle, idx, name)
+  elif op_type == OpType.TREE_INC_MULTIHEAD_SELF_ATTENTION:
+    return TreeIncMultiHeadSelfAttention(handle, idx, name)
+  elif op_type == OpType.RMS_NORM:
+    return RMSNorm(handle, idx, name)
+  elif op_type == OpType.ARG_TOPK:
+    return ArgTopK(handle, idx, name)
+  elif op_type == OpType.BEAM_TOPK:
+    return BeamTopK(handle, idx, name)
+  elif op_type == OpType.SAMPLING:
+    return Sampling(handle, idx, name)
+  elif op_type == OpType.ARGMAX:
+    return ArgMax(handle, idx, name)
   elif op_type == OpType.RSQRT:
     return Rsqrt(handle, idx, name)
   elif op_type == OpType.POW:
@@ -553,10 +628,50 @@ def epochs(self):
   @property
   def enable_control_replication(self):
     return ffc.flexflow_config_get_enable_control_replication(self.handle)
+  
+  @property
+  def data_parallelism_degree(self):
+    return ffc.flexflow_config_get_data_parallelism_degree(self.handle)
+  
+  @data_parallelism_degree.setter
+  def data_parallelism_degree(self, value):
+    if type(value) is not int: 
+      raise ValueError("The data parallelism degree must be specified as an integer number")
+    elif value < 1:
+      raise ValueError("The data parallelism degree cannot be lower than 1")
+    ffc.flexflow_config_set_data_parallelism_degree(self.handle, value)
+  
+  @property
+  def tensor_parallelism_degree(self):
+    return ffc.flexflow_config_get_tensor_parallelism_degree(self.handle)
+  
+  @tensor_parallelism_degree.setter
+  def tensor_parallelism_degree(self, value):
+    if type(value) is not int: 
+      raise ValueError("The tensor parallelism degree must be specified as an integer number")
+    elif value < 1:
+      raise ValueError("The tensor parallelism degree cannot be lower than 1")
+    ffc.flexflow_config_set_tensor_parallelism_degree(self.handle, value)
+  
+  @property
+  def pipeline_parallelism_degree(self):
+    return ffc.flexflow_config_get_pipeline_parallelism_degree(self.handle)
+  
+  @pipeline_parallelism_degree.setter
+  def pipeline_parallelism_degree(self, value):
+    if type(value) is not int: 
+      raise ValueError("The pipeline parallelism degree must be specified as an integer number")
+    elif value < 1:
+      raise ValueError("The pipeline parallelism degree cannot be lower than 1")
+    ffc.flexflow_config_set_pipeline_parallelism_degree(self.handle, value)
     
   @property
   def python_data_loader_type(self):
     return ffc.flexflow_config_get_python_data_loader_type(self.handle)
+  
+  @property
+  def cpu_offload(self):
+    return ffc.flexflow_config_get_offload(self.handle)
 
   def get_current_time(self):
     return ffc.flexflow_get_current_time(self.handle)
@@ -670,7 +785,11 @@ def set_tensor(self, ffmodel, np_array):
       assert np_shape[i] == self.dims[i], "please check shape dim %d (%d == %d)" %(i, np_shape[i], self.dims[i])
     c_dims = ffi.new("int[]", self.dims)
     np_raw_ptr = np_array.__array_interface__['data']
-    if np_array.dtype == np.float32:
+    if np_array.dtype == np.float16:
+      assert self.data_type == DataType.DT_HALF, "Wrong datatype"
+      raw_ptr = ffi.cast("half*", np_raw_ptr[0])
+      ret_val = ffc.flexflow_tensor_set_tensor_float(self.handle, ffmodel.handle, num_dims, c_dims, raw_ptr)
+    elif np_array.dtype == np.float32:
       assert self.data_type == DataType.DT_FLOAT, "Wrong datatype"
       raw_ptr = ffi.cast("float*", np_raw_ptr[0])
       ret_val = ffc.flexflow_tensor_set_tensor_float(self.handle, ffmodel.handle, num_dims, c_dims, raw_ptr)
@@ -685,7 +804,9 @@ def set_tensor(self, ffmodel, np_array):
     
   def get_tensor(self, ffmodel):
     shape = self.dims
-    if self.data_type == DataType.DT_FLOAT:
+    if self.data_type == DataType.DT_HALF:
+      np_array = np.empty(shape, dtype=np.float16)
+    elif self.data_type == DataType.DT_FLOAT:
       np_array = np.empty(shape, dtype=np.float32)
     elif self.data_type == DataType.DT_INT32:
       np_array = np.empty(shape, dtype=np.int32)
@@ -709,7 +830,9 @@ def get_tensor(self, ffmodel):
 
   def get_gradients(self, ffmodel, comm_type):
     shape = self.dims
-    if self.data_type == DataType.DT_FLOAT:
+    if self.data_type == DataType.DT_HALF:
+      np_array = np.empty(shape, dtype=np.float16)
+    elif self.data_type == DataType.DT_FLOAT:
       np_array = np.empty(shape, dtype=np.float32)
     elif self.data_type == DataType.DT_INT32:
       np_array = np.empty(shape, dtype=np.int32)
@@ -734,7 +857,9 @@ def get_gradients(self, ffmodel, comm_type):
   
   def get_model_output_gradients(self, ffmodel, comm_type):
     shape = self.dims
-    if self.data_type == DataType.DT_FLOAT:
+    if self.data_type == DataType.DT_HALF:
+      np_array = np.empty(shape, dtype=np.float16)
+    elif self.data_type == DataType.DT_FLOAT:
       np_array = np.empty(shape, dtype=np.float32)
     elif self.data_type == DataType.DT_INT32:
       np_array = np.empty(shape, dtype=np.int32)
@@ -755,7 +880,9 @@ def get_model_output_gradients(self, ffmodel, comm_type):
   
   def get_model_output_tensor(self, ffmodel):
     shape = self.dims
-    if self.data_type == DataType.DT_FLOAT:
+    if self.data_type == DataType.DT_HALF:
+      np_array = np.empty(shape, dtype=np.float16)
+    elif self.data_type == DataType.DT_FLOAT:
       np_array = np.empty(shape, dtype=np.float32)
     elif self.data_type == DataType.DT_INT32:
       np_array = np.empty(shape, dtype=np.int32)
@@ -775,7 +902,9 @@ def get_model_output_tensor(self, ffmodel):
 
   def __get_raw_ptr(self, ffmodel, ffconfig, data_type):
     assert data_type == self.data_type, "Tensor check data type"
-    if (data_type == DataType.DT_FLOAT):
+    if (data_type == DataType.DT_HALF):
+      return ffc.flexflow_tensor_get_raw_ptr_float(self.handle, ffmodel.handle, ffconfig.handle)
+    elif (data_type == DataType.DT_FLOAT):
       return ffc.flexflow_tensor_get_raw_ptr_float(self.handle, ffmodel.handle, ffconfig.handle)
     elif (data_type == DataType.DT_INT32):
       return ffc.flexflow_tensor_get_raw_ptr_int32(self.handle, ffmodel.handle, ffconfig.handle)
@@ -896,7 +1025,7 @@ def __init__(self, ffconfig):
 
     :returns:  FFModel -- the model.
     """
-    self.handle = ffc.flexflow_model_create(ffconfig.handle)
+    self.handle = ffc.flexflow_model_create(ffconfig.handle, ffconfig.cpu_offload)
     self._handle = ffi.gc(self.handle, ffc.flexflow_model_destroy)
     self._layers = dict()
     self._nb_layers = 0
@@ -1290,7 +1419,7 @@ def conv2d(self, input, out_channels,
     return Tensor(handle, owner_op_type=OpType.CONV2D)
 
   def embedding(self, input, num_embeddings, embedding_dim, 
-                aggr, shared_op=None, kernel_initializer=None, name=None):
+                aggr, dtype=DataType.DT_FLOAT, shared_op=None, kernel_initializer=None, name=None):
     """Layer that turns positive integers into dense vectors of fixed size
              
     :param input: the input Tensor.
@@ -1304,6 +1433,9 @@ def embedding(self, input, num_embeddings, embedding_dim,
                 
     :param aggr: aggregation mode. Options are AGGR_MODE_NONE, AGGR_MODE_SUM and AGGR_MODE_AVG.
     :type aggr: AggrMode
+
+    :param dtype: the tensor data type. Options are DT_BOOLEAN, DT_INT32, DT_INT64, DT_HALF, DT_FLOAT, DT_DOUBLE, DT_INT4, DT_INT8, DT_NONE
+    :type dtype: DataType
                 
     :param shared_op: the layer whose parameters are shared with. Default is None.
     :type shared_op: Op  
@@ -1319,6 +1451,7 @@ def embedding(self, input, num_embeddings, embedding_dim,
     c_name = get_c_name(name)
     shared_op_handle = self.__get_op_handle(shared_op)
     c_aggr = enum_to_int(AggrMode, aggr)
+    c_dtype = enum_to_int(DataType, dtype)
     if kernel_initializer is None:
       kernel_initializer = GlorotUniformInitializer(42)
     assert (type(kernel_initializer) is GlorotUniformInitializer) or \
@@ -1327,7 +1460,7 @@ def embedding(self, input, num_embeddings, embedding_dim,
       (type(kernel_initializer) is NormInitializer), \
       f"Unknown initializer type: {kernel_initializer}"
     handle = ffc.flexflow_model_add_embedding(
-      self.handle, input.handle, num_embeddings, embedding_dim, c_aggr,
+      self.handle, input.handle, num_embeddings, embedding_dim, c_aggr, c_dtype,
       shared_op_handle, kernel_initializer.handle, c_name,
     )
     # NOTE: We must keep a reference to the initializer or else it will be
@@ -1471,7 +1604,7 @@ def batch_matmul(self, A, B, a_seq_length_dim=None, b_seq_length_dim=None, name=
   def dense(self, input, out_dim, 
             activation=ActiMode.AC_MODE_NONE, 
             use_bias=True, 
-            datatype=DataType.DT_FLOAT, 
+            datatype=DataType.DT_NONE, 
             shared_op=None,
             kernel_initializer=None, bias_initializer=None, 
             kernel_regularizer=None, name=None):
@@ -1968,6 +2101,527 @@ def multihead_attention(self, query, key, value,
     handle = ffc.flexflow_model_add_multihead_attention(self.handle, query.handle, key.handle, value.handle, embed_dim, num_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, kernel_init_handle, c_name)
     self.add_layer(OpType.MULTIHEAD_ATTENTION, name)
     return Tensor(handle, owner_op_type=OpType.MULTIHEAD_ATTENTION)
+  
+  def inc_multihead_self_attention(self, input, 
+                              embed_dim, num_heads,
+                              kdim=0, vdim=0, dropout=0.0, 
+                              bias=True, add_bias_kv=False, add_zero_attn=False, 
+                              data_type=DataType.DT_NONE, kernel_initializer=None, 
+                              apply_rotary_embedding=False, scaling_query=False, scaling_factor=1.0,
+                              qk_prod_scaling=True, name=None):
+    """Defines the MultiHead Attention operation as described in Attention Is All You Need 
+    which takes in the tensors :attr:`input`, and uses it for all three of query, key and values. 
+    In inference mode, the attention is computed using incremental decoding.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param embed_dim: total dimension of the model
+    :type embed_dim: int
+                          
+    :param num_heads: Number of attention heads.
+    :type num_heads: int
+                          
+    :param kdim: total number of features in key. Default is 0
+    :type kdim: int
+                          
+    :param vdim: total number of features in value. Default is 0
+    :type vdim: int
+                          
+    :param dropout: a Dropout layer on attn_output_weights. Default is 0.0
+    :type dropout: float(0-1)
+                          
+    :param bias: Whether the dense layers use bias vectors. Default is True.
+    :type bias: bool
+                          
+    :param add_bias_kv: add bias to the key and value sequences at dim=0. Default is False.
+    :type add_bias_kv: bool
+                          
+    :param add_zero_attn: add a new batch of zeros to the key and value sequences at dim=1. Default is False.
+    :type add_zero_attn: bool
+
+    :param data_type: the data type of the tensors. Default is DataType.DT_NONE, which means using the data type of the input tensors.
+    :type data_type: DataType
+    
+    :param kernel_initializer: Initializer for dense layer kernels. If it is set to None, the GlorotUniformInitializer is applied.
+    :type kernel_initializer: Initializer
+
+    :param apply_rotary_embedding: Whether to apply rotary embeddings. Default is False.
+    :type apply_rotary_embedding: bool
+
+    :param scaling_query: Whether to apply scaling query. Default is False.
+    :type scaling_query: bool
+
+    :param scaling_factor: The scaling factor to use for scaling. Default is 1.0.
+    :type scaling_factor: float
+
+    :param qk_prod_scaling: Whether to apply scaling to the QK product. Default is True.
+    :type qk_prod_scaling: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """     
+    c_name = get_c_name(name)                 
+    kernel_init_handle = self.__get_initializer_handle(kernel_initializer)
+    c_data_type = enum_to_int(DataType, data_type)
+    handle = ffc.flexflow_model_add_inc_multihead_self_attention(self.handle, input.handle, embed_dim, num_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, c_data_type, kernel_init_handle, apply_rotary_embedding, scaling_query, scaling_factor, qk_prod_scaling, c_name)
+    self.add_layer(OpType.INC_MULTIHEAD_ATTENTION, name)
+    return Tensor(handle, owner_op_type=OpType.INC_MULTIHEAD_ATTENTION)
+  
+  def spec_inc_multihead_self_attention(self, input, 
+                                   embed_dim, num_heads,
+                                   kdim=0, vdim=0, dropout=0.0, 
+                                   bias=True, add_bias_kv=False, add_zero_attn=False, 
+                                   data_type=DataType.DT_NONE, kernel_initializer=None, 
+                                   apply_rotary_embedding=False, scaling_query=False, scaling_factor=1.0,
+                                   qk_prod_scaling=True, name=None):
+    """Defines the MultiHead Attention operation as described in Attention Is All You Need 
+    which takes in the tensors :attr:`input`, and uses it for all three of query, key and values. 
+    This operator only supports computing the attention in inference (beam search) mode.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param embed_dim: total dimension of the model
+    :type embed_dim: int
+                          
+    :param num_heads: Number of attention heads.
+    :type num_heads: int
+                          
+    :param kdim: total number of features in key. Default is 0
+    :type kdim: int
+                          
+    :param vdim: total number of features in value. Default is 0
+    :type vdim: int
+                          
+    :param dropout: a Dropout layer on attn_output_weights. Default is 0.0
+    :type dropout: float(0-1)
+                          
+    :param bias: Whether the dense layers use bias vectors. Default is True.
+    :type bias: bool
+                          
+    :param add_bias_kv: add bias to the key and value sequences at dim=0. Default is False.
+    :type add_bias_kv: bool
+                          
+    :param add_zero_attn: add a new batch of zeros to the key and value sequences at dim=1. Default is False.
+    :type add_zero_attn: bool
+
+    :param data_type: the data type of the tensors. Default is DataType.DT_NONE, which means using the data type of the input tensors.
+    :type data_type: DataType
+    
+    :param kernel_initializer: Initializer for dense layer kernels. If it is set to None, the GlorotUniformInitializer is applied.
+    :type kernel_initializer: Initializer
+
+    :param apply_rotary_embedding: Whether to apply rotary embeddings. Default is False.
+    :type apply_rotary_embedding: bool
+
+    :param scaling_query: Whether to apply scaling query. Default is False.
+    :type scaling_query: bool
+
+    :param scaling_factor: The scaling factor to use for scaling. Default is 1.0.
+    :type scaling_factor: float
+
+    :param qk_prod_scaling: Whether to apply scaling to the QK product. Default is True.
+    :type qk_prod_scaling: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """     
+    c_name = get_c_name(name)                 
+    kernel_init_handle = self.__get_initializer_handle(kernel_initializer)
+    c_data_type = enum_to_int(DataType, data_type)
+    handle = ffc.flexflow_model_add_spec_inc_multihead_self_attention(self.handle, input.handle, embed_dim, num_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, c_data_type, kernel_init_handle, apply_rotary_embedding, scaling_query, scaling_factor, qk_prod_scaling, c_name)
+    self.add_layer(OpType.SPEC_INC_MULTIHEAD_SELF_ATTENTION, name)
+    return Tensor(handle, owner_op_type=OpType.SPEC_INC_MULTIHEAD_SELF_ATTENTION)
+  
+  def inc_multihead_self_attention_verify(self, input, 
+                                          embed_dim, num_heads,
+                                          kdim=0, vdim=0, dropout=0.0, 
+                                          bias=True, add_bias_kv=False, add_zero_attn=False, 
+                                          data_type=DataType.DT_NONE, kernel_initializer=None, 
+                                          apply_rotary_embedding=False, scaling_query=False, scaling_factor=1.0,
+                                          qk_prod_scaling=True, name=None):
+    """Defines the MultiHead Attention operation as described in Attention Is All You Need 
+    which takes in the tensors :attr:`input`, and uses it for all three of query, key and values. 
+    This operator only supports computing the attention in inference (tree verify) mode.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param embed_dim: total dimension of the model
+    :type embed_dim: int
+                          
+    :param num_heads: Number of attention heads.
+    :type num_heads: int
+                          
+    :param kdim: total number of features in key. Default is 0
+    :type kdim: int
+                          
+    :param vdim: total number of features in value. Default is 0
+    :type vdim: int
+                          
+    :param dropout: a Dropout layer on attn_output_weights. Default is 0.0
+    :type dropout: float(0-1)
+                          
+    :param bias: Whether the dense layers use bias vectors. Default is True.
+    :type bias: bool
+                          
+    :param add_bias_kv: add bias to the key and value sequences at dim=0. Default is False.
+    :type add_bias_kv: bool
+                          
+    :param add_zero_attn: add a new batch of zeros to the key and value sequences at dim=1. Default is False.
+    :type add_zero_attn: bool
+
+    :param data_type: the data type of the tensors. Default is DataType.DT_NONE, which means using the data type of the input tensors.
+    :type data_type: DataType
+    
+    :param kernel_initializer: Initializer for dense layer kernels. If it is set to None, the GlorotUniformInitializer is applied.
+    :type kernel_initializer: Initializer
+
+    :param apply_rotary_embedding: Whether to apply rotary embeddings. Default is False.
+    :type apply_rotary_embedding: bool
+
+    :param scaling_query: Whether to apply scaling query. Default is False.
+    :type scaling_query: bool
+
+    :param scaling_factor: The scaling factor to use for scaling. Default is 1.0.
+    :type scaling_factor: float
+
+    :param qk_prod_scaling: Whether to apply scaling to the QK product. Default is True.
+    :type qk_prod_scaling: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """     
+    c_name = get_c_name(name)                 
+    kernel_init_handle = self.__get_initializer_handle(kernel_initializer)
+    c_data_type = enum_to_int(DataType, data_type)
+    handle = ffc.flexflow_model_add_inc_multihead_self_attention_verify(self.handle, input.handle, embed_dim, num_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, c_data_type, kernel_init_handle, apply_rotary_embedding, scaling_query, scaling_factor, qk_prod_scaling, c_name)
+    self.add_layer(OpType.TREE_INC_MULTIHEAD_SELF_ATTENTION, name)
+    return Tensor(handle, owner_op_type=OpType.TREE_INC_MULTIHEAD_SELF_ATTENTION)
+  
+  def inc_multiquery_self_attention(self, input, 
+                              embed_dim, num_q_heads, num_kv_heads,
+                              kdim=0, vdim=0, dropout=0.0, 
+                              bias=True, add_bias_kv=False, add_zero_attn=False, 
+                              data_type=DataType.DT_NONE, kernel_initializer=None, 
+                              apply_rotary_embedding=False, scaling_query=False, scaling_factor=1.0,
+                              qk_prod_scaling=True, name=None):
+    """Defines the multi-query head attention, which allows a different number of Q and KV heads,
+    and takes in the tensors :attr:`input`, and uses it for all three of query, key and values. 
+    In inference mode, the attention is computed using incremental decoding.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param embed_dim: total dimension of the model
+    :type embed_dim: int
+                          
+    :param num_q_heads: Number of query attention heads.
+    :type num_q_heads: int
+
+    :param num_kv_heads: Number of key/value attention heads.
+    :type num_kv_heads: int
+                          
+    :param kdim: total number of features in key. Default is 0
+    :type kdim: int
+                          
+    :param vdim: total number of features in value. Default is 0
+    :type vdim: int
+                          
+    :param dropout: a Dropout layer on attn_output_weights. Default is 0.0
+    :type dropout: float(0-1)
+                          
+    :param bias: Whether the dense layers use bias vectors. Default is True.
+    :type bias: bool
+                          
+    :param add_bias_kv: add bias to the key and value sequences at dim=0. Default is False.
+    :type add_bias_kv: bool
+                          
+    :param add_zero_attn: add a new batch of zeros to the key and value sequences at dim=1. Default is False.
+    :type add_zero_attn: bool
+
+    :param data_type: the data type of the tensors. Default is DataType.DT_NONE, which means using the data type of the input tensors.
+    :type data_type: DataType
+    
+    :param kernel_initializer: Initializer for dense layer kernels. If it is set to None, the GlorotUniformInitializer is applied.
+    :type kernel_initializer: Initializer
+
+    :param apply_rotary_embedding: Whether to apply rotary embeddings. Default is False.
+    :type apply_rotary_embedding: bool
+
+    :param scaling_query: Whether to apply scaling query. Default is False.
+    :type scaling_query: bool
+
+    :param scaling_factor: The scaling factor to use for scaling. Default is 1.0.
+    :type scaling_factor: float
+
+    :param qk_prod_scaling: Whether to apply scaling to the QK product. Default is True.
+    :type qk_prod_scaling: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """     
+    c_name = get_c_name(name)                 
+    kernel_init_handle = self.__get_initializer_handle(kernel_initializer)
+    c_data_type = enum_to_int(DataType, data_type)
+    handle = ffc.flexflow_model_add_inc_multiquery_self_attention(self.handle, input.handle, embed_dim, num_q_heads, num_kv_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, c_data_type, kernel_init_handle, apply_rotary_embedding, scaling_query, scaling_factor, qk_prod_scaling, c_name)
+    self.add_layer(OpType.INC_MULTIHEAD_ATTENTION, name)
+    return Tensor(handle, owner_op_type=OpType.INC_MULTIHEAD_ATTENTION)
+  
+  def spec_inc_multiquery_self_attention(self, input, 
+                                   embed_dim, num_q_heads, num_kv_heads,
+                                   kdim=0, vdim=0, dropout=0.0, 
+                                   bias=True, add_bias_kv=False, add_zero_attn=False, 
+                                   data_type=DataType.DT_NONE, kernel_initializer=None, 
+                                   apply_rotary_embedding=False, scaling_query=False, scaling_factor=1.0,
+                                   qk_prod_scaling=True, name=None):
+    """Defines the multi-query head attention, which allows a different number of Q and KV heads,
+    and takes in the tensors :attr:`input`, and uses it for all three of query, key and values. 
+    This operator only supports computing the attention in inference (beam search) mode.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param embed_dim: total dimension of the model
+    :type embed_dim: int
+                          
+    :param num_q_heads: Number of query attention heads.
+    :type num_q_heads: int
+
+    :param num_kv_heads: Number of key/value attention heads.
+    :type num_kv_heads: int
+                          
+    :param kdim: total number of features in key. Default is 0
+    :type kdim: int
+                          
+    :param vdim: total number of features in value. Default is 0
+    :type vdim: int
+                          
+    :param dropout: a Dropout layer on attn_output_weights. Default is 0.0
+    :type dropout: float(0-1)
+                          
+    :param bias: Whether the dense layers use bias vectors. Default is True.
+    :type bias: bool
+                          
+    :param add_bias_kv: add bias to the key and value sequences at dim=0. Default is False.
+    :type add_bias_kv: bool
+                          
+    :param add_zero_attn: add a new batch of zeros to the key and value sequences at dim=1. Default is False.
+    :type add_zero_attn: bool
+
+    :param data_type: the data type of the tensors. Default is DataType.DT_NONE, which means using the data type of the input tensors.
+    :type data_type: DataType
+    
+    :param kernel_initializer: Initializer for dense layer kernels. If it is set to None, the GlorotUniformInitializer is applied.
+    :type kernel_initializer: Initializer
+
+    :param apply_rotary_embedding: Whether to apply rotary embeddings. Default is False.
+    :type apply_rotary_embedding: bool
+
+    :param scaling_query: Whether to apply scaling query. Default is False.
+    :type scaling_query: bool
+
+    :param scaling_factor: The scaling factor to use for scaling. Default is 1.0.
+    :type scaling_factor: float
+
+    :param qk_prod_scaling: Whether to apply scaling to the QK product. Default is True.
+    :type qk_prod_scaling: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """     
+    c_name = get_c_name(name)                 
+    kernel_init_handle = self.__get_initializer_handle(kernel_initializer)
+    c_data_type = enum_to_int(DataType, data_type)
+    handle = ffc.flexflow_model_add_spec_inc_multiquery_self_attention(self.handle, input.handle, embed_dim, num_q_heads, num_kv_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, c_data_type, kernel_init_handle, apply_rotary_embedding, scaling_query, scaling_factor, qk_prod_scaling, c_name)
+    self.add_layer(OpType.SPEC_INC_MULTIHEAD_SELF_ATTENTION, name)
+    return Tensor(handle, owner_op_type=OpType.SPEC_INC_MULTIHEAD_SELF_ATTENTION)
+  
+  def inc_multiquery_self_attention_verify(self, input, 
+                                          embed_dim, num_q_heads, num_kv_heads,
+                                          kdim=0, vdim=0, dropout=0.0, 
+                                          bias=True, add_bias_kv=False, add_zero_attn=False, 
+                                          data_type=DataType.DT_NONE, kernel_initializer=None, 
+                                          apply_rotary_embedding=False, scaling_query=False, scaling_factor=1.0,
+                                          qk_prod_scaling=True, name=None):
+    """Defines the multi-query head attention, which allows a different number of Q and KV heads,
+    and takes in the tensors :attr:`input`, and uses it for all three of query, key and values. 
+    This operator only supports computing the attention in inference (tree verify) mode.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param embed_dim: total dimension of the model
+    :type embed_dim: int
+                          
+    :param num_q_heads: Number of query attention heads.
+    :type num_q_heads: int
+
+    :param num_kv_heads: Number of key/value attention heads.
+    :type num_kv_heads: int
+                          
+    :param kdim: total number of features in key. Default is 0
+    :type kdim: int
+                          
+    :param vdim: total number of features in value. Default is 0
+    :type vdim: int
+                          
+    :param dropout: a Dropout layer on attn_output_weights. Default is 0.0
+    :type dropout: float(0-1)
+                          
+    :param bias: Whether the dense layers use bias vectors. Default is True.
+    :type bias: bool
+                          
+    :param add_bias_kv: add bias to the key and value sequences at dim=0. Default is False.
+    :type add_bias_kv: bool
+                          
+    :param add_zero_attn: add a new batch of zeros to the key and value sequences at dim=1. Default is False.
+    :type add_zero_attn: bool
+
+    :param data_type: the data type of the tensors. Default is DataType.DT_NONE, which means using the data type of the input tensors.
+    :type data_type: DataType
+    
+    :param kernel_initializer: Initializer for dense layer kernels. If it is set to None, the GlorotUniformInitializer is applied.
+    :type kernel_initializer: Initializer
+
+    :param apply_rotary_embedding: Whether to apply rotary embeddings. Default is False.
+    :type apply_rotary_embedding: bool
+
+    :param scaling_query: Whether to apply scaling query. Default is False.
+    :type scaling_query: bool
+
+    :param scaling_factor: The scaling factor to use for scaling. Default is 1.0.
+    :type scaling_factor: float
+
+    :param qk_prod_scaling: Whether to apply scaling to the QK product. Default is True.
+    :type qk_prod_scaling: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """     
+    c_name = get_c_name(name)                 
+    kernel_init_handle = self.__get_initializer_handle(kernel_initializer)
+    c_data_type = enum_to_int(DataType, data_type)
+    handle = ffc.flexflow_model_add_inc_multiquery_self_attention_verify(self.handle, input.handle, embed_dim, num_q_heads, num_kv_heads, kdim, vdim, dropout, bias, add_bias_kv, add_zero_attn, c_data_type, kernel_init_handle, apply_rotary_embedding, scaling_query, scaling_factor, qk_prod_scaling, c_name)
+    self.add_layer(OpType.TREE_INC_MULTIHEAD_SELF_ATTENTION, name)
+    return Tensor(handle, owner_op_type=OpType.TREE_INC_MULTIHEAD_SELF_ATTENTION)
+  
+  def rms_norm(self, input, eps, dim, name=None):
+    """Defines the RMS Norm layer.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param eps: a value added to the denominator for numerical stability
+    :type eps: float
+                          
+    :param dim: The dimension with respect to which to take the norm
+    :type dim: int
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """
+    c_name = get_c_name(name)
+    handle = ffc.flexflow_model_add_rms_norm(self.handle, input.handle, eps, dim, c_name)
+    self.add_layer(OpType.RMS_NORM, name)
+    return Tensor(handle, owner_op_type=OpType.RMS_NORM)
+  
+  def arg_top_k(self, input, k, sorted, name=None):
+    """Defines the Arg TopK layer.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param k: the top k indices to select
+    :type k: int
+                          
+    :param sorted: Whether the entries should be sorted
+    :type sorted: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """
+    c_name = get_c_name(name)
+    handle = ffc.flexflow_model_add_arg_top_k(self.handle, input.handle, k, sorted, c_name)
+    self.add_layer(OpType.ARG_TOPK, name)
+    return Tensor(handle, owner_op_type=OpType.ARG_TOPK)
+
+  def beam_top_k(self, input, max_beam_size, sorted, name=None):
+    """Defines the Beam TopK layer.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param max_beam_size: the top max_beam_size indices to select
+    :type max_beam_size: int
+                          
+    :param sorted: Whether the entries should be sorted
+    :type sorted: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """
+    c_name = get_c_name(name)
+    handle = ffc.flexflow_model_add_beam_top_k(self.handle, input.handle, max_beam_size, sorted, c_name)
+    self.add_layer(OpType.BEAM_TOPK, name)
+    return Tensor(handle, owner_op_type=OpType.BEAM_TOPK)
+  
+  def sampling(self, input, top_p, name=None):
+    """Defines the Sampling layer.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param top_p: The top_p parameter of the sampling
+    :type top_p: float
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """
+    c_name = get_c_name(name)
+    handle = ffc.flexflow_model_add_sampling(self.handle, input.handle, top_p, c_name)
+    self.add_layer(OpType.SAMPLING, name)
+    return Tensor(handle, owner_op_type=OpType.SAMPLING)
+  
+  def argmax(self, input, beam_search, name=None):
+    """Defines the Sampling layer.
+             
+    :param input: the input Tensor.
+    :type input: Tensor
+
+    :param beam_search: Whether you need to perform beam search
+    :type beam_search: bool
+             
+    :param name: the name of the layer. Default is None.
+    :type name: string
+
+    :returns:  Tensor -- the output tensor.
+    """
+    c_name = get_c_name(name)
+    handle = ffc.flexflow_model_add_argmax(self.handle, input.handle, beam_search, c_name)
+    self.add_layer(OpType.ARGMAX, name)
+    return Tensor(handle, owner_op_type=OpType.ARGMAX)
 
   def reset_metrics(self):
     """Reset performance metrics.
@@ -2192,6 +2846,9 @@ def label_tensor(self):
   def get_perf_metrics(self):
     handle = ffc.flexflow_model_get_perf_metrics(self.handle)
     return PerfMetrics(handle)
+  
+  def set_transformer_layer_id(self, id):
+    ffc.flexflow_model_set_transformer_layer_id(self.handle, id)
     
   def create_data_loader(self, batch_tensor, full_array):
     """Create a SingleDataloader instance. 
@@ -2218,7 +2875,9 @@ def __create_data_loader_attach(self, batch_tensor, full_array):
     full_array_shape = full_array.shape
     num_samples = full_array_shape[0]
     num_dim = len(full_array_shape)
-    if (full_array.dtype == "float32"):
+    if (full_array.dtype == "float16"):
+      datatype = DataType.DT_HALF
+    elif (full_array.dtype == "float32"):
       datatype = DataType.DT_FLOAT
     elif (full_array.dtype == "int32"):
       datatype = DataType.DT_INT32
@@ -2245,7 +2904,9 @@ def __create_data_loader_attach(self, batch_tensor, full_array):
   def __create_data_loader_ptr(self, batch_tensor, full_array):
     full_array_shape = full_array.shape
     num_samples = full_array_shape[0]
-    if (full_array.dtype == "float32"):
+    if (full_array.dtype == "float16"):
+      datatype = DataType.DT_HALF
+    elif (full_array.dtype == "float32"):
       datatype = DataType.DT_FLOAT
     elif (full_array.dtype == "int32"):
       datatype = DataType.DT_INT32
@@ -2278,7 +2939,9 @@ def __get_op_handle(self, shared_op):
   
   def get_output_tensor(self, ffmodel, data_type):
     shape = self.dims
-    if data_type == DataType.DT_FLOAT:
+    if data_type == DataType.DT_HALF:
+      np_array = np.empty(shape, dtype=np.float16)
+    elif data_type == DataType.DT_FLOAT:
       np_array = np.empty(shape, dtype=np.float32)
     elif self.data_type == DataType.DT_INT32:
       np_array = np.empty(shape, dtype=np.int32)
@@ -2299,6 +2962,22 @@ def get_output_tensor(self, ffmodel, data_type):
     fflogger.debug("get weights raw_ptr: %s, %s, %s, %s" %( str(raw_ptr), str(np_raw_ptr[0]), hex(np_raw_ptr[0]), str(shape)))
     assert ret_val == True
     return np_array   
+  
+  def generate(self, prompt, max_sequence_length):
+    c_input_text = get_c_name(prompt)
+    max_num_chars = 36000
+    c_output_text = ffi.new("char[]", max_num_chars)
+    c_output_length_and_tokens = ffi.new("int[]", max_sequence_length + 100)
+    ffc.flexflow_model_generate(self.handle, c_input_text, max_num_chars, c_output_text, max_sequence_length, c_output_length_and_tokens)
+    output_length = c_output_length_and_tokens[0]
+    output_tokens = []
+    for i in range(output_length):
+      output_tokens.append(c_output_length_and_tokens[i+1])
+    from flexflow.serve import GenerationResult
+    return GenerationResult(ffi.string(c_output_text), output_tokens)
+  
+  def set_position_offset(self, offset):
+    ffc.flexflow_model_set_position_offset(self.handle, offset)
 
 # -----------------------------------------------------------------------
 # SGDOptimizer
@@ -2495,7 +3174,9 @@ class RegionNdarray(object):
   __slots__ = ['__array_interface__']
   def __init__(self, shape, data_type, base_ptr, strides, read_only):
     # See: https://docs.scipy.org/doc/numpy/reference/arrays.interface.html
-    if (data_type == DataType.DT_FLOAT):
+    if (data_type == DataType.DT_HALF):
+      field_type = "<f2" 
+    elif (data_type == DataType.DT_FLOAT):
       field_type = "<f4"
     elif (data_type == DataType.DT_INT32):
       field_type = "<i4"
@@ -2509,3 +3190,101 @@ def __init__(self, shape, data_type, base_ptr, strides, read_only):
       'data': (base_ptr, read_only),
       'strides': strides,
     }
+
+# -----------------------------------------------------------------------
+# BatchConfig
+# -----------------------------------------------------------------------
+
+class BatchConfig(object):
+  __slots__ = ['handle', '_handle']
+  def __init__(self):
+    self.handle = ffc.flexflow_batch_config_create()
+    self._handle = ffi.gc(self.handle, ffc.flexflow_batch_config_destroy)
+
+# -----------------------------------------------------------------------
+# TreeVerifyBatchConfig
+# -----------------------------------------------------------------------
+
+class TreeVerifyBatchConfig(object):
+  __slots__ = ['handle', '_handle']
+  def __init__(self):
+    self.handle = ffc.flexflow_tree_verify_batch_config_create()
+    self._handle = ffi.gc(self.handle, ffc.flexflow_tree_verify_batch_config_destroy)
+
+# -----------------------------------------------------------------------
+# BeamSearchBatchConfig
+# -----------------------------------------------------------------------
+
+class BatchConfig(object):
+  __slots__ = ['handle', '_handle']
+  def __init__(self):
+    self.handle = ffc.flexflow_beam_search_batch_config_create()
+    self._handle = ffi.gc(self.handle, ffc.flexflow_beam_search_batch_config_destroy)
+
+# -----------------------------------------------------------------------
+# RequestManager
+# -----------------------------------------------------------------------
+
+class RequestManager(object):
+  __slots__ = ['handle']
+  def __init__(self):
+    self.handle = ffc.flexflow_request_manager_get_request_manager()
+    #self._handle = ffi.gc(self.handle, ffc.flexflow_request_manager_destroy)
+
+  def register_tokenizer(self, model_type, bos_token_id, eos_token_id, tokenizer_filepath):
+    c_model_type = enum_to_int(ModelType, model_type)
+    c_tokenizer_filepath = get_c_name(tokenizer_filepath)
+    return ffc.flexflow_request_manager_register_tokenizer(self.handle, c_model_type, bos_token_id, eos_token_id, c_tokenizer_filepath)
+  
+  def register_output_filepath(self, output_filepath):
+    c_output_filepath = get_c_name(output_filepath)
+    return ffc.flexflow_request_manager_register_output_filepath(self.handle, c_output_filepath)
+
+  def register_ssm_model(self, model):
+    return ffc.flexflow_request_manager_register_ssm_model(self.handle, model.handle)
+  
+# -----------------------------------------------------------------------
+# InferenceManager
+# -----------------------------------------------------------------------
+
+class InferenceManager(object):
+  __slots__ = ['handle']
+  def __init__(self):
+    self.handle = ffc.flexflow_inference_manager_get_inference_manager()
+    #self._handle = ffi.gc(self.handle, ffc.flexflow_inference_manager_destroy)
+
+  def compile_model_and_allocate_buffer(self, model):
+    ffc.flexflow_inference_manager_compile_model_and_allocate_buffer(self.handle, model.handle)
+
+  def init_operators_inference(self, model):
+    ffc.flexflow_inference_manager_init_operators_inference(self.handle, model.handle)
+
+# -----------------------------------------------------------------------
+# FileDataLoader
+# -----------------------------------------------------------------------
+
+class FileDataLoader(object):
+  __slots__ = ['handle', '_handle']
+  def __init__(self, weight_file_path, num_q_heads, num_kv_heads, hidden_dim, qkv_inner_dim, tensor_parallelism_degree):
+    c_weight_file_path = get_c_name(weight_file_path)
+    self.handle = ffc.flexflow_file_data_loader_create(c_weight_file_path, num_q_heads, num_kv_heads, hidden_dim, qkv_inner_dim, tensor_parallelism_degree)
+    self._handle = ffi.gc(self.handle, ffc.flexflow_file_data_loader_destroy)
+  
+  def load_weights(self, model, model_layers_with_weights, data_type):
+    # Extract keys and values into arrays
+    layer_names = list(model_layers_with_weights.keys()) 
+    layers = list(model_layers_with_weights.values())
+    
+    # Convert to char** and flexflow_op_t* for CFFI
+    layer_names_c = [ffi.new("char[]", x.encode('ascii')) for x in layer_names]
+    layer_handles_list = [layer.handle for layer in layers]
+    layer_handles_c = ffi.new("flexflow_op_t[]", layer_handles_list)
+    
+    # Compute number of layers (key-value pairs)
+    num_layers = len(layer_names)
+    assert(len(layer_names) == len(layers))
+
+    # Check data type and create use_full_precision boolean
+    assert(data_type == DataType.DT_FLOAT or data_type == DataType.DT_HALF)
+    use_full_precision = data_type == DataType.DT_FLOAT
+    ffc.flexflow_file_data_loader_load_weights(self.handle, model.handle, num_layers, layer_names_c, layer_handles_c, use_full_precision)
diff --git a/python/flexflow/serve/__init__.py b/python/flexflow/serve/__init__.py
new file mode 100644
index 0000000000..e45b9759a0
--- /dev/null
+++ b/python/flexflow/serve/__init__.py
@@ -0,0 +1,229 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys, os
+from typing import Union, Optional
+from ..type import *
+
+
+def _parse_positive_int_config(name: str, variable: str, ff_cli_name: str = None):
+    if variable is not None:
+        if type(variable) is not int:
+            raise ValueError(
+                f"The following configs take positive integers only: {name}"
+            )
+        elif variable <= 0:
+            raise ValueError(
+                f"The following configs take positive integers only: {name}"
+            )
+        if not ff_cli_name:
+            sys.argv += ["-{name}", str(variable)]
+        else:
+            sys.argv += [f"{ff_cli_name}", str(variable)]
+
+
+def init(configs_dict: Optional[dict] = None, 
+        *, 
+        num_gpus: Optional[int] = None,
+        memory_per_gpu: Optional[int] = None,
+        zero_copy_memory_per_node: Optional[int] = None,
+        num_cpus: Optional[int] = None,
+        legion_utility_processors: Optional[int] = None,
+        data_parallelism_degree: Optional[int] = None,
+        tensor_parallelism_degree: Optional[int] = None,
+        pipeline_parallelism_degree: Optional[int] = None,
+        offload: Optional[bool] = None,
+        offload_reserve_space_size: Optional[int] = None,
+        use_4bit_quantization: Optional[bool] = None,
+        use_8bit_quantization: Optional[bool] = None,
+        profiling: Optional[bool] = None,
+        fusion: Optional[bool] = None):
+    """
+    Configure FlexFlow Serve and start the runtime. 
+    
+    The function takes, alternatively, configs_dict (a positional argument of type dictionary),
+    or three mandatory named parameters, plus some additional optional named parameters. When passing
+    a configs_dict, no named parameter should be specified, and the dictionary should have keys matching
+    at least the mandatory named parameters.
+    
+    The three mandatory parameters, which cannot be changed after starting the runtime, are:
+    - num_gpus: the number of GPUs to reserve for the runtime
+    - memory_per_gpu: the amount of memory (in MB) to pre-allocate on each GPU
+    - zero_copy_memory_per_node: the amount of zero-copy memory (in MB) to pre-allocate for each node
+    
+    The optional parameters are: 
+    - num_cpus: the number of CPU processors to reserve for the runtime, defaults to 4
+    - legion_utility_processors: number of Legion utility threads to create per process, defaults to 1
+    - data_parallelism_degree: the degree of parallelization in the data parallel dimension, defaults to 1
+    - tensor_parallelism_degree: the degree of parallelization in the tensor parallel dimension (using the Megatron technique), defaults to 1
+    - pipeline_parallelism_degree: the degree of parallelization in the pipeline parallel dimension, defaults to 1
+    - offload: whether to enable offloading of the weights to CPU, defaults to False
+    - offload_reserve_space_size: the space (in MB) to reserve on CPU for offloading, default to 1024^2
+    - use_4bit_quantization: whether to use 4-bit quantization, defaults to False
+    - use_8bit_quantization: whether to use 8-bit quantization, defaults to False
+    - profiling: whether to enable the FlexFlow profiling mode, defaults to False
+    - fusion: whether to enable the FlexFlow operator fusion optimization, defaults to True
+    
+    The configurations are passed down to the FlexFlow runtime (implemented in C++) via command line arguments.
+
+
+    :param configs_dict: A Python dictionary to pass all configurations as a single object
+    :type configs_dict: dict
+    :param num_gpus: the number of GPUs to reserve for the runtime
+    :type num_gpus: int
+    :param memory_per_gpu: memory_per_gpu: the amount of memory (in MB) to pre-allocate on each GPU
+    :type memory_per_gpu: int
+    :param zero_copy_memory_per_node: zero_copy_memory_per_node: the amount of zero-copy memory (in MB) to pre-allocate for each node
+    :type zero_copy_memory_per_node: int
+    :param num_cpus: the number of CPU processors to reserve for the runtime, defaults to 4
+    :type num_cpus: Optional[int], optional
+    :param legion_utility_processors: number of Legion utility threads to create per process, defaults to 1
+    :type legion_utility_processors: Optional[int], optional
+    :param data_parallelism_degree: the degree of parallelization in the data parallel dimension, defaults to 1
+    :type data_parallelism_degree: Optional[int], optional
+    :param tensor_parallelism_degree: the degree of parallelization in the tensor parallel dimension (using the Megatron technique), defaults to 1
+    :type tensor_parallelism_degree: Optional[int], optional
+    :param pipeline_parallelism_degree: the degree of parallelization in the pipeline parallel dimension, defaults to 1
+    :type pipeline_parallelism_degree: Optional[int], optional
+    :param offload: whether to enable offloading of the weights to CPU, defaults to False
+    :type offload: Optional[bool], optional
+    :param offload_reserve_space_size: the space (in MB) to reserve on CPU for offloading, default to 1024^2
+    :type offload_reserve_space_size: Optional[int], optional
+    :param use_4bit_quantization: whether to use 4-bit quantization, defaults to False
+    :type use_4bit_quantization: Optional[bool], optional
+    :param use_8bit_quantization: whether to use 8-bit quantization, defaults to False
+    :type use_8bit_quantization: Optional[bool], optional
+    :param profiling: whether to enable the FlexFlow profiling mode, defaults to False
+    :type profiling: Optional[bool], optional
+    :param fusion: whether to enable the FlexFlow operator fusion optimization, defaults to True
+    :type fusion: Optional[bool], optional
+    
+    :raises ValueError: this function will raise an exception if the user passes both a configs_dict and some named parameters
+    :raises TypeError: this function will raise an exception if the configs_dict is not a dictionary
+    :raises ValueError: this function will raise an exception if the mandatory FlexFlow initialization parameters are missing, or are not positive integers: num_gpus, memory_per_gpu, zero_copy_memory_per_node
+    """
+    
+    # Check that either configs_dict or any of individual, non-positional arguments (after the *) is passed, but not both
+    if configs_dict is not None and any([
+        num_gpus is not None,
+        memory_per_gpu is not None,
+        zero_copy_memory_per_node is not None,
+        num_cpus is not None,
+        legion_utility_processors is not None,
+        data_parallelism_degree is not None,
+        tensor_parallelism_degree is not None,
+        pipeline_parallelism_degree is not None,
+        offload is not None,
+        offload_reserve_space_size is not None,
+        use_4bit_quantization is not None,
+        use_8bit_quantization is not None,
+        profiling is not None,
+        fusion is not None,
+    ]):
+        raise ValueError("Cannot pass both configs_dict and individual args")
+
+    if configs_dict is not None:
+        # If configs_dict is passed, check that the type is dictionary and that the mandatory key-value pairs are present (num_gpus, memory_per_gpu, zero_copy_memory_per_node)
+        if type(configs_dict) != dict:
+            raise TypeError("configs_dict is not a dictionary")
+        # configs should contain the following mandatory keys with non-zero integer values:
+        num_gpus = configs_dict.get("num_gpus")
+        memory_per_gpu = configs_dict.get("memory_per_gpu")
+        zero_copy_memory_per_node = configs_dict.get("zero_copy_memory_per_node")
+        if not num_gpus or not memory_per_gpu or not zero_copy_memory_per_node:
+            raise ValueError(
+                "Missing one of the following configs in config dict: num_gpus, memory_per_gpu, zero_copy_memory_per_node"
+            )
+        num_cpus = configs_dict.get("num_cpus")
+        legion_utility_processors = configs_dict.get("legion_utility_processors", 8)
+        data_parallelism_degree = configs_dict.get("data_parallelism_degree")
+        tensor_parallelism_degree = configs_dict.get("tensor_parallelism_degree")
+        pipeline_parallelism_degree = configs_dict.get("pipeline_parallelism_degree")
+        offload = configs_dict.get("offload", False)
+        offload_reserve_space_size = configs_dict.get("offload_reserve_space_size")
+        use_4bit_quantization = configs_dict.get("use_4bit_quantization", False)
+        use_8bit_quantization = configs_dict.get("use_8bit_quantization", False)
+        profiling = configs_dict.get("profiling", False)
+        fusion = configs_dict.get("fusion", True)
+    else:
+        # If configs_dict is not passed, check that the mandatory parameters are passed directly as arguments
+        if not num_gpus or not memory_per_gpu or not zero_copy_memory_per_node:
+            raise ValueError(
+            "Missing one of the following configs in input params: num_gpus, memory_per_gpu, zero_copy_memory_per_node"
+        )
+        offload = False if offload is None else offload
+        use_4bit_quantization = False if use_4bit_quantization is None else use_4bit_quantization
+        use_8bit_quantization = False if use_8bit_quantization is None else use_8bit_quantization
+        profiling = False if profiling is None else profiling
+        fusion = True if fusion is None else fusion
+        
+    # Remove the arguments to avoid interferences
+    sys.argv = [sys.argv[0]]
+               
+    # parse arguments     
+    _parse_positive_int_config("num_gpus", num_gpus, "-ll:gpu")
+    _parse_positive_int_config("memory_per_gpu", memory_per_gpu, "-ll:fsize")
+    _parse_positive_int_config(
+        "zero_copy_memory_per_node", zero_copy_memory_per_node, "-ll:zsize"
+    )
+
+    # parse optional arguments
+    _parse_positive_int_config("num_cpus", num_cpus, "-ll:cpu")
+    _parse_positive_int_config(
+        "legion_utility_processors", legion_utility_processors, "-ll:util"
+    )
+    _parse_positive_int_config(
+        "data_parallelism_degree", data_parallelism_degree, "-data-parallelism-degree"
+    )
+    _parse_positive_int_config(
+        "tensor_parallelism_degree",
+        tensor_parallelism_degree,
+        "-tensor-parallelism-degree",
+    )
+    _parse_positive_int_config(
+        "pipeline_parallelism_degree",
+        pipeline_parallelism_degree,
+        "-pipeline-parallelism-degree",
+    )
+    if offload:
+        sys.argv += ["-offload"]
+    _parse_positive_int_config(
+        "offload_reserve_space_size",
+        offload_reserve_space_size,
+        "-offload-reserve-space-size",
+    )
+    if use_4bit_quantization:
+        sys.argv += ["--4bit-quantization"]
+    if use_8bit_quantization:
+        sys.argv += ["--8bit-quantization"]
+    if profiling:
+        sys.argv += ["--profiling"]
+    if fusion:
+        sys.argv += ["--fusion"]
+
+    global LLM, SSM, GenerationConfig, GenerationResult
+    from .serve import LLM, SSM, GenerationConfig, GenerationResult
+
+
+def init_cpu():
+    """Start the FlexFlow runtime and import the inference package without access to GPU functionalities.
+    This is useful to access the utilies from the flexflow package without using up GPU memory.
+    """
+    # Remove the arguments to avoid interferences
+    sys.argv = [sys.argv[0]]
+    # Ask the runtime to avoid using GPU/GPU memory
+    os.environ["CPU_ONLY_TEST"] = "1"
+
+    global LLM, SSM, GenerationConfig, GenerationResult
+    from .serve import LLM, SSM, GenerationConfig, GenerationResult
diff --git a/python/flexflow/serve/models/__init__.py b/python/flexflow/serve/models/__init__.py
new file mode 100644
index 0000000000..6b405b2f99
--- /dev/null
+++ b/python/flexflow/serve/models/__init__.py
@@ -0,0 +1,18 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .llama import FlexFlowLLAMA
+from .opt import FlexFlowOPT
+from .falcon import FlexFlowFalcon
+from .starcoder import FlexFlowSTARCODER
diff --git a/python/flexflow/serve/models/base.py b/python/flexflow/serve/models/base.py
new file mode 100644
index 0000000000..b7f4e54fc1
--- /dev/null
+++ b/python/flexflow/serve/models/base.py
@@ -0,0 +1,39 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class FlexFlowModel:
+    def __init__(
+        self,
+        mode,
+        generation_config,
+        ffconfig,
+        hf_config,
+        data_type,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+        weights_filepath="",
+        tokenizer_filepath="",
+    ):
+        self.build_model()
+
+    def build_model(self):
+        assert False, "Not implemented yet"
+
+    def convert_hf_model(model, dst_folder):
+        assert False, "Not implemented yet"
+
+    def get_layers_with_weights(self):
+        assert False, "Not implemented yet"
diff --git a/python/flexflow/serve/models/falcon.py b/python/flexflow/serve/models/falcon.py
new file mode 100644
index 0000000000..4fcaca6c33
--- /dev/null
+++ b/python/flexflow/serve/models/falcon.py
@@ -0,0 +1,272 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from flexflow.core import *
+from .base import FlexFlowModel
+import random, torch
+
+
+class FalconConfig:
+    def __init__(self, hf_config):
+        self.max_seq_len = 256
+        self.max_num_tokens = 64
+        self.max_beam_width = 1
+        self.max_beam_depth = 8
+        self.bias = hf_config.bias
+        self.hidden_size = hf_config.hidden_size
+        self.layer_norm_epsilon = hf_config.layer_norm_epsilon
+        self.multi_query = hf_config.multi_query
+        self.n_head = hf_config.n_head
+        self.n_head_kv = hf_config.n_head_kv if "n_head_kv" in hf_config.__dict__ else 1
+        self.n_layer = hf_config.n_layer
+        self.parallel_attn = hf_config.parallel_attn
+        self.vocab_size = hf_config.vocab_size
+
+
+class FlexFlowFalcon(FlexFlowModel):
+    def __init__(
+        self,
+        mode,
+        generation_config,
+        ffconfig,
+        hf_config,
+        data_type,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+        weights_filepath="",
+        tokenizer_filepath="",
+    ):
+        self.mode = mode
+        self.generation_config = generation_config
+        self.ffconfig = ffconfig
+        self.max_batch_size = max_batch_size
+        self.data_type = data_type
+        self.falcon_config = FalconConfig(hf_config)
+        self.falcon_config.max_seq_length = max_seq_length
+        self.falcon_config.max_num_tokens = max_tokens_per_batch
+        self.weights_filepath = weights_filepath
+        self.tokenizer_filepath = tokenizer_filepath
+        self.maxint = 2**31 - 1
+
+        # Sanity checks
+        if self.falcon_config.hidden_size % self.falcon_config.n_head != 0:
+            raise ValueError(
+                f"Hidden size ({self.falcon_config.hidden_size}) is not divisible by n_head ({self.falcon_config.n_head})"
+            )
+        if (
+            self.falcon_config.n_head < self.ffconfig.tensor_parallelism_degree
+            or self.falcon_config.n_head % self.ffconfig.tensor_parallelism_degree != 0
+        ):
+            raise ValueError(
+                f"Number of q attention heads ({self.falcon_config.n_head}) is smaller, or not divisible by tensor parallelism degree ({self.ffconfig.tensor_parallelism_degree})"
+            )
+        if (
+            self.falcon_config.n_head_kv < self.ffconfig.tensor_parallelism_degree
+            or self.falcon_config.n_head_kv % self.ffconfig.tensor_parallelism_degree
+            != 0
+        ):
+            raise ValueError(
+                f"Number of k/v attention heads ({self.falcon_config.n_head_kv}) is smaller, or not divisible by tensor parallelism degree ({self.ffconfig.tensor_parallelism_degree})"
+            )
+
+        self.build_model()
+
+    def build_model(self):
+        ffmodel = FFModel(self.ffconfig)
+
+        tokens_dims = [self.falcon_config.max_num_tokens, 1]
+        input_tensor = ffmodel.create_tensor(tokens_dims, DataType.DT_INT32)
+
+        embed_init = UniformInitializer(random.randint(0, self.maxint), 0, 0)
+        token = ffmodel.embedding(
+            input_tensor,
+            self.falcon_config.vocab_size,
+            self.falcon_config.hidden_size,
+            AggrMode.AGGR_MODE_NONE,
+            self.data_type,
+            None,
+            embed_init,
+            name="word_embeddings_weight",
+        )
+        axes = [
+            0,
+        ]
+
+        for i in range(self.falcon_config.n_layer):
+            ffmodel.set_transformer_layer_id(i)
+
+            att_norm = ffmodel.layer_norm(
+                token,
+                axes,
+                True,
+                self.falcon_config.layer_norm_epsilon,
+                name=f"layers_{i}_input_layernorm_weight",
+            )
+
+            if self.mode == InferenceMode.BEAM_SEARCH_MODE:
+                mha = ffmodel.spec_inc_multiquery_self_attention(
+                    att_norm,
+                    self.falcon_config.hidden_size,
+                    self.falcon_config.n_head,
+                    self.falcon_config.n_head_kv,
+                    self.falcon_config.hidden_size // self.falcon_config.n_head,
+                    self.falcon_config.hidden_size // self.falcon_config.n_head,
+                    0.0,  # dropout
+                    False,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    True,  # apply_rotary_embedding
+                    name=f"layers_{i}_attention_weight",
+                )
+            elif self.mode == InferenceMode.TREE_VERIFY_MODE:
+                mha = ffmodel.inc_multiquery_self_attention_verify(
+                    att_norm,
+                    self.falcon_config.hidden_size,
+                    self.falcon_config.n_head,
+                    self.falcon_config.n_head_kv,
+                    self.falcon_config.hidden_size // self.falcon_config.n_head,
+                    self.falcon_config.hidden_size // self.falcon_config.n_head,
+                    0.0,  # dropout
+                    False,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    True,  # apply_rotary_embedding
+                    name=f"layers_{i}_attention_weight",
+                )
+            elif self.mode == InferenceMode.INC_DECODING_MODE:
+                mha = ffmodel.inc_multiquery_self_attention(
+                    att_norm,
+                    self.falcon_config.hidden_size,
+                    self.falcon_config.n_head,
+                    self.falcon_config.n_head_kv,
+                    self.falcon_config.hidden_size // self.falcon_config.n_head,
+                    self.falcon_config.hidden_size // self.falcon_config.n_head,
+                    0.0,  # dropout
+                    False,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    True,  # apply_rotary_embedding
+                    name=f"layers_{i}_attention_weight",
+                )
+            else:
+                assert False
+
+            dense_h_to_4h = ffmodel.dense(
+                att_norm,
+                self.falcon_config.hidden_size * 4,
+                ActiMode.AC_MODE_NONE,
+                False,
+                name=f"layers_{i}_mlp_dense_h_to_4h_weight",
+            )
+            dense_h_to_4h = ffmodel.gelu(dense_h_to_4h)
+            mlp_output = ffmodel.dense(
+                dense_h_to_4h,
+                self.falcon_config.hidden_size,
+                ActiMode.AC_MODE_NONE,
+                False,
+                name=f"layers_{i}_mlp_dense_4h_to_h_weight",
+            )
+
+            token = ffmodel.add(token, mha)
+            token = ffmodel.add(token, mlp_output)
+
+        ln_f = ffmodel.layer_norm(
+            token, axes, True, self.falcon_config.layer_norm_epsilon, name="ln_f_weight"
+        )
+        lm_head = ffmodel.dense(
+            ln_f,
+            self.falcon_config.vocab_size,
+            ActiMode.AC_MODE_NONE,
+            False,
+            name="lm_head_weight",
+        )
+
+        if self.mode == InferenceMode.BEAM_SEARCH_MODE:
+            softmax = ffmodel.softmax(lm_head, -1)
+            # output = ffmodel.beam_top_k(softmax, self.falcon_config.max_beam_width, False)
+            output = ffmodel.argmax(softmax, True)
+        else:
+            if self.generation_config.do_sample:
+                dense = ffmodel.scalar_true_divide(
+                    lm_head, self.generation_config.temperature, False
+                )
+                softmax = ffmodel.softmax(dense, -1)
+                output = ffmodel.sampling(softmax, self.generation_config.topp)
+            else:
+                # output = ffmodel.arg_top_k(lm_head, 1, False)
+                output = ffmodel.argmax(lm_head, False)
+
+        self.ffmodel = ffmodel
+
+    def convert_hf_model(model, dst_folder):
+        os.makedirs(dst_folder, exist_ok=True)
+        for name, params in model.named_parameters():
+            name = (
+                name.replace(".", "_")
+                .replace("transformer_h_", "layers_")
+                .replace("transformer_", "")
+                .replace("self_attention_dense", "attention_wo")
+            )
+            # Split Q,K,V attention weights
+            if "self_attention_query_key_value" in name:
+                name_q = name.replace("self_attention_query_key_value", "attention_wq")
+                name_k = name.replace("self_attention_query_key_value", "attention_wk")
+                name_v = name.replace("self_attention_query_key_value", "attention_wv")
+                q, k, v = torch.split(
+                    params,
+                    [
+                        model.config.hidden_size,
+                        model.config.hidden_size // model.config.n_head,
+                        model.config.hidden_size // model.config.n_head,
+                    ],
+                    0,
+                )
+                q.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_q))
+                k.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_k))
+                v.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_v))
+            else:
+                params.detach().cpu().numpy().tofile(os.path.join(dst_folder, name))
+        # LM head weight
+        model.lm_head.weight.detach().cpu().numpy().tofile(
+            os.path.join(dst_folder, "lm_head_weight")
+        )
+
+    def get_layers_with_weights(self):
+        layer_names = [
+            "word_embeddings_weight",
+            "ln_f_weight",
+            "lm_head_weight",
+        ] + [
+            expr
+            for i in range(self.falcon_config.n_layer)
+            for expr in (
+                f"layers_{i}_input_layernorm_weight",
+                f"layers_{i}_attention_weight",
+                f"layers_{i}_mlp_dense_h_to_4h_weight",
+                f"layers_{i}_mlp_dense_4h_to_h_weight",
+            )
+        ]
+        layers_with_weights = {
+            layer_name: self.ffmodel.get_layer_by_name(layer_name)
+            for layer_name in layer_names
+        }
+
+        return layers_with_weights
diff --git a/python/flexflow/serve/models/llama.py b/python/flexflow/serve/models/llama.py
new file mode 100644
index 0000000000..c716bff34d
--- /dev/null
+++ b/python/flexflow/serve/models/llama.py
@@ -0,0 +1,269 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from flexflow.core import *
+from .base import FlexFlowModel
+import random
+
+
+class LLAMAConfig:
+    def __init__(self, hf_config):
+        self.max_seq_len = 256
+        self.max_num_tokens = 64
+        self.max_beam_width = 1
+        self.max_beam_depth = 8
+        self.num_hidden_layers = hf_config.num_hidden_layers
+        self.vocab_size = hf_config.vocab_size
+        self.num_attention_heads = hf_config.num_attention_heads
+        self.hidden_size = hf_config.hidden_size
+        self.rms_norm_eps = hf_config.rms_norm_eps
+        self.intermediate_size = hf_config.intermediate_size
+
+
+class FlexFlowLLAMA(FlexFlowModel):
+    def __init__(
+        self,
+        mode,
+        generation_config,
+        ffconfig,
+        hf_config,
+        data_type,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+        weights_filepath="",
+        tokenizer_filepath="",
+    ):
+        self.mode = mode
+        self.generation_config = generation_config
+        self.ffconfig = ffconfig
+        self.max_batch_size = max_batch_size
+        self.data_type = data_type
+        self.llama_config = LLAMAConfig(hf_config)
+        self.llama_config.max_seq_length = max_seq_length
+        self.llama_config.max_num_tokens = max_tokens_per_batch
+        self.weights_filepath = weights_filepath
+        self.tokenizer_filepath = tokenizer_filepath
+        self.maxint = 2**31 - 1
+
+        # Sanity checks
+        if self.llama_config.hidden_size % self.llama_config.num_attention_heads != 0:
+            raise ValueError(
+                f"Hidden size ({self.llama_config.hidden_size}) is not divisible by number of attention heads ({self.llama_config.num_attention_heads})"
+            )
+
+        # Sanity checks
+        if (
+            self.llama_config.num_attention_heads
+            < self.ffconfig.tensor_parallelism_degree
+            or self.llama_config.num_attention_heads
+            % self.ffconfig.tensor_parallelism_degree
+            != 0
+        ):
+            raise ValueError(
+                f"Number of attention heads ({self.llama_config.num_attention_heads}) is smaller, or not divisible by tensor parallelism degree ({self.ffconfig.tensor_parallelism_degree})"
+            )
+
+        self.build_model()
+
+    def build_model(self):
+        ffmodel = FFModel(self.ffconfig)
+
+        tokens_dims = [self.llama_config.max_num_tokens, 1]
+        input_tensor = ffmodel.create_tensor(tokens_dims, DataType.DT_INT32)
+
+        embed_init = UniformInitializer(random.randint(0, self.maxint), 0, 0)
+        token = ffmodel.embedding(
+            input_tensor,
+            self.llama_config.vocab_size,
+            self.llama_config.hidden_size,
+            AggrMode.AGGR_MODE_NONE,
+            self.data_type,
+            None,
+            embed_init,
+            name="tok_embeddings_weight",
+        )
+
+        for i in range(self.llama_config.num_hidden_layers):
+            ffmodel.set_transformer_layer_id(i)
+
+            attn_norm = ffmodel.rms_norm(
+                token,
+                self.llama_config.rms_norm_eps,
+                self.llama_config.hidden_size,
+                name=f"layers_{i}_attention_norm_weight",
+            )
+
+            if self.mode == InferenceMode.BEAM_SEARCH_MODE:
+                mha = ffmodel.spec_inc_multihead_self_attention(
+                    attn_norm,
+                    self.llama_config.hidden_size,
+                    self.llama_config.num_attention_heads,
+                    self.llama_config.hidden_size
+                    // self.llama_config.num_attention_heads,
+                    self.llama_config.hidden_size
+                    // self.llama_config.num_attention_heads,
+                    0.0,  # dropout
+                    False,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    True,  # apply_rotary_embedding
+                    name=f"layers_{i}_attention_weight",
+                )
+            elif self.mode == InferenceMode.TREE_VERIFY_MODE:
+                mha = ffmodel.inc_multihead_self_attention_verify(
+                    attn_norm,
+                    self.llama_config.hidden_size,
+                    self.llama_config.num_attention_heads,
+                    self.llama_config.hidden_size
+                    // self.llama_config.num_attention_heads,
+                    self.llama_config.hidden_size
+                    // self.llama_config.num_attention_heads,
+                    0.0,  # dropout
+                    False,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    True,  # apply_rotary_embedding
+                    name=f"layers_{i}_attention_weight",
+                )
+            elif self.mode == InferenceMode.INC_DECODING_MODE:
+                mha = ffmodel.inc_multihead_self_attention(
+                    attn_norm,
+                    self.llama_config.hidden_size,
+                    self.llama_config.num_attention_heads,
+                    self.llama_config.hidden_size
+                    // self.llama_config.num_attention_heads,
+                    self.llama_config.hidden_size
+                    // self.llama_config.num_attention_heads,
+                    0.0,  # dropout
+                    False,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    True,  # apply_rotary_embedding
+                    name=f"layers_{i}_attention_weight",
+                )
+            else:
+                assert False
+
+            token = ffmodel.add(token, mha)
+            ff_norm = ffmodel.rms_norm(
+                token,
+                self.llama_config.rms_norm_eps,
+                self.llama_config.hidden_size,
+                name=f"layers_{i}_ffn_norm_weight",
+            )
+            w1 = ffmodel.dense(
+                ff_norm,
+                self.llama_config.intermediate_size,
+                ActiMode.AC_MODE_NONE,
+                False,
+                name=f"layers_{i}_feed_forward_w1_weight",
+            )
+            w3 = ffmodel.dense(
+                ff_norm,
+                self.llama_config.intermediate_size,
+                ActiMode.AC_MODE_NONE,
+                False,
+                name=f"layers_{i}_feed_forward_w3_weight",
+            )
+            sigmoid = ffmodel.sigmoid(w1)
+            silu = ffmodel.multiply(w1, sigmoid)
+            multi = ffmodel.multiply(silu, w3)
+            w2 = ffmodel.dense(
+                multi,
+                self.llama_config.hidden_size,
+                ActiMode.AC_MODE_NONE,
+                False,
+                name=f"layers_{i}_feed_forward_w2_weight",
+            )
+            token = ffmodel.add(token, w2)
+
+        token = ffmodel.rms_norm(
+            token,
+            self.llama_config.rms_norm_eps,
+            self.llama_config.hidden_size,
+            name="norm_weight",
+        )
+        dense = ffmodel.dense(
+            token,
+            self.llama_config.vocab_size,
+            ActiMode.AC_MODE_NONE,
+            False,
+            name="output_weight",
+        )
+
+        if self.mode == InferenceMode.BEAM_SEARCH_MODE:
+            softmax = ffmodel.softmax(dense, -1)
+            # output = ffmodel.beam_top_k(softmax, self.llama_config.max_beam_width, False)
+            output = ffmodel.argmax(softmax, True)
+        else:
+            if self.generation_config.do_sample:
+                dense = ffmodel.scalar_true_divide(
+                    dense, self.generation_config.temperature, False
+                )
+                softmax = ffmodel.softmax(dense, -1)
+                output = ffmodel.sampling(softmax, self.generation_config.topp)
+            else:
+                # output = ffmodel.arg_top_k(dense, 1, False)
+                output = ffmodel.argmax(dense, False)
+
+        self.ffmodel = ffmodel
+
+    def convert_hf_model(model, dst_folder):
+        os.makedirs(dst_folder, exist_ok=True)
+        for name, params in model.named_parameters():
+            name = (
+                name.replace(".", "_")
+                .replace("self_attn", "attention")
+                .replace("q_proj", "wq")
+                .replace("k_proj", "wk")
+                .replace("v_proj", "wv")
+                .replace("o_proj", "wo")
+                .replace("mlp", "feed_forward")
+                .replace("gate_proj", "w1")
+                .replace("down_proj", "w2")
+                .replace("up_proj", "w3")
+                .replace("input_layernorm", "attention_norm")
+                .replace("post_attention_layernorm", "ffn_norm")
+                .replace("embed_tokens", "tok_embeddings")
+                .replace("lm_head", "output")
+                .replace("model_", "")
+            )
+            params.detach().cpu().numpy().tofile(f"{dst_folder}/{name}")
+
+    def get_layers_with_weights(self):
+        layer_names = ["tok_embeddings_weight", "norm_weight", "output_weight"] + [
+            expr
+            for i in range(self.llama_config.num_hidden_layers)
+            for expr in (
+                f"layers_{i}_attention_norm_weight",
+                f"layers_{i}_attention_weight",
+                f"layers_{i}_ffn_norm_weight",
+                f"layers_{i}_feed_forward_w1_weight",
+                f"layers_{i}_feed_forward_w3_weight",
+                f"layers_{i}_feed_forward_w2_weight",
+            )
+        ]
+        layers_with_weights = {
+            layer_name: self.ffmodel.get_layer_by_name(layer_name)
+            for layer_name in layer_names
+        }
+
+        return layers_with_weights
diff --git a/python/flexflow/serve/models/opt.py b/python/flexflow/serve/models/opt.py
new file mode 100644
index 0000000000..d18c0d4cc9
--- /dev/null
+++ b/python/flexflow/serve/models/opt.py
@@ -0,0 +1,318 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from flexflow.core import *
+from .base import FlexFlowModel
+import random, shutil
+
+
+class OPTConfig:
+    def __init__(self, hf_config):
+        self.max_seq_len = 256
+        self.max_num_tokens = 64
+        self.max_beam_width = 1
+        self.max_beam_depth = 8
+        self.do_layer_norm_before = hf_config.do_layer_norm_before
+        self.dropout = hf_config.dropout
+        self.enable_bias = hf_config.enable_bias
+        self.ffn_dim = hf_config.ffn_dim
+        self.hidden_size = hf_config.hidden_size
+        self.layer_norm_elementwise_affine = hf_config.layer_norm_elementwise_affine
+        self.max_position_embeddings = hf_config.max_position_embeddings
+        self.num_attention_heads = hf_config.num_attention_heads
+        self.num_hidden_layers = hf_config.num_hidden_layers
+        self.vocab_size = hf_config.vocab_size
+        self.word_embed_proj_dim = hf_config.word_embed_proj_dim
+
+
+class FlexFlowOPT(FlexFlowModel):
+    def __init__(
+        self,
+        mode,
+        generation_config,
+        ffconfig,
+        hf_config,
+        data_type,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+        weights_filepath="",
+        tokenizer_filepath="",
+    ):
+        self.mode = mode
+        self.generation_config = generation_config
+        self.ffconfig = ffconfig
+        self.max_batch_size = max_batch_size
+        self.data_type = data_type
+        self.opt_config = OPTConfig(hf_config)
+        self.opt_config.max_seq_length = max_seq_length
+        self.opt_config.max_num_tokens = max_tokens_per_batch
+        self.weights_filepath = weights_filepath
+        self.tokenizer_filepath = tokenizer_filepath
+        self.maxint = 2**31 - 1
+
+        # Sanity checks
+        if self.opt_config.hidden_size % self.opt_config.num_attention_heads != 0:
+            raise ValueError(
+                f"Hidden size ({self.opt_config.hidden_size}) is not divisible by n_head ({self.opt_config.num_attention_heads})"
+            )
+
+        # Sanity checks
+        if (
+            self.opt_config.num_attention_heads
+            < self.ffconfig.tensor_parallelism_degree
+            or self.opt_config.num_attention_heads
+            % self.ffconfig.tensor_parallelism_degree
+            != 0
+        ):
+            raise ValueError(
+                f"Number of attention heads ({self.opt_config.num_attention_heads}) is smaller, or not divisible by tensor parallelism degree ({self.ffconfig.tensor_parallelism_degree})"
+            )
+
+        self.build_model()
+
+    def build_model(self):
+        ffmodel = FFModel(self.ffconfig)
+
+        tokens_dims = [self.opt_config.max_num_tokens, 1]
+        input_tensor = ffmodel.create_tensor(tokens_dims, DataType.DT_INT32)
+        position_tensor = ffmodel.create_tensor(tokens_dims, DataType.DT_INT32)
+
+        # OPT model positional embedding start offset is 2
+        ffmodel.set_position_offset(2)
+        embed_init = UniformInitializer(random.randint(0, self.maxint), 0, 0)
+        token = ffmodel.embedding(
+            input_tensor,
+            self.opt_config.vocab_size,
+            self.opt_config.word_embed_proj_dim,
+            AggrMode.AGGR_MODE_NONE,
+            self.data_type,
+            None,
+            embed_init,
+            name="embed_tokens_weight",
+        )
+        positional_embedding = ffmodel.embedding(
+            position_tensor,
+            self.opt_config.max_position_embeddings,
+            self.opt_config.hidden_size,
+            AggrMode.AGGR_MODE_NONE,
+            self.data_type,
+            None,
+            embed_init,
+            name="embed_positions_weight",
+        )
+
+        residual = ffmodel.add(token, positional_embedding)
+
+        axes = [
+            0,
+        ]
+
+        for i in range(self.opt_config.num_hidden_layers):
+            ffmodel.set_transformer_layer_id(i)
+
+            if self.opt_config.do_layer_norm_before:
+                hidden_states = ffmodel.layer_norm(
+                    residual,
+                    axes,
+                    self.opt_config.layer_norm_elementwise_affine,
+                    1e-05,
+                    name=f"layers_{i}_attention_layer_norm_weight",
+                )
+            else:
+                hidden_states = residual
+
+            if self.mode == InferenceMode.BEAM_SEARCH_MODE:
+                mha = ffmodel.spec_inc_multihead_self_attention(
+                    hidden_states,
+                    self.opt_config.hidden_size,
+                    self.opt_config.num_attention_heads,
+                    self.opt_config.hidden_size // self.opt_config.num_attention_heads,
+                    self.opt_config.hidden_size // self.opt_config.num_attention_heads,
+                    0.0,  # dropout
+                    True,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    False,  # apply_rotary_embedding
+                    True,  # scaling_query
+                    (self.opt_config.hidden_size / self.opt_config.num_attention_heads)
+                    ** (-0.5),  # scaling_factor
+                    False,  # qk_prod_scaling
+                    name=f"layers_{i}_attention_weight",
+                )
+            elif self.mode == InferenceMode.TREE_VERIFY_MODE:
+                mha = ffmodel.inc_multihead_self_attention_verify(
+                    hidden_states,
+                    self.opt_config.hidden_size,
+                    self.opt_config.num_attention_heads,
+                    self.opt_config.hidden_size // self.opt_config.num_attention_heads,
+                    self.opt_config.hidden_size // self.opt_config.num_attention_heads,
+                    0.0,  # dropout
+                    True,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    False,  # apply_rotary_embedding
+                    True,  # scaling_query
+                    (self.opt_config.hidden_size / self.opt_config.num_attention_heads)
+                    ** (-0.5),  # scaling_factor
+                    False,  # qk_prod_scaling
+                    name=f"layers_{i}_attention_weight",
+                )
+            elif self.mode == InferenceMode.INC_DECODING_MODE:
+                mha = ffmodel.inc_multihead_self_attention(
+                    hidden_states,
+                    self.opt_config.hidden_size,
+                    self.opt_config.num_attention_heads,
+                    self.opt_config.hidden_size // self.opt_config.num_attention_heads,
+                    self.opt_config.hidden_size // self.opt_config.num_attention_heads,
+                    0.0,  # dropout
+                    True,  # bias
+                    False,  # add_bias_kv
+                    False,  # add_zero_attn
+                    DataType.DT_NONE,  # data_type
+                    None,  # kernel initializer
+                    False,  # apply_rotary_embedding
+                    True,  # scaling_query
+                    (self.opt_config.hidden_size / self.opt_config.num_attention_heads)
+                    ** (-0.5),  # scaling_factor
+                    False,  # qk_prod_scaling
+                    name=f"layers_{i}_attention_weight",
+                )
+            else:
+                assert False
+
+            residual = ffmodel.add(mha, residual)
+
+            # This is either a before or after attention LayerNorm. In both cases, we need to compute the LN here.
+            norm_name = (
+                f"layers_{i}_final_layer_norm_weight"
+                if self.opt_config.do_layer_norm_before
+                else f"layers_{i}_attention_layer_norm_weight"
+            )
+            ff_norm = ffmodel.layer_norm(
+                residual,
+                axes,
+                self.opt_config.layer_norm_elementwise_affine,
+                1e-05,
+                name=norm_name,
+            )
+
+            if not self.opt_config.do_layer_norm_before:
+                residual = ff_norm
+
+            fc1 = ffmodel.dense(
+                ff_norm,
+                self.opt_config.ffn_dim,
+                ActiMode.AC_MODE_NONE,
+                True,
+                name=f"layers_{i}_fc1_weight",
+            )
+            activation = ffmodel.relu(fc1, False)
+            fc2 = ffmodel.dense(
+                activation,
+                self.opt_config.hidden_size,
+                ActiMode.AC_MODE_NONE,
+                True,
+                name=f"layers_{i}_fc2_weight",
+            )
+            residual = ffmodel.add(residual, fc2)
+
+            if not self.opt_config.do_layer_norm_before:
+                residual = ffmodel.layer_norm(
+                    residual,
+                    axes,
+                    self.opt_config.layer_norm_elementwise_affine,
+                    1e-05,
+                    name=f"layers_{i}_final_layer_norm_weight",
+                )
+
+        all_final_norm = ffmodel.layer_norm(
+            residual,
+            axes,
+            self.opt_config.layer_norm_elementwise_affine,
+            1e-05,
+            name=f"final_layer_norm_weight",
+        )
+        lm_head = ffmodel.dense(
+            all_final_norm,
+            self.opt_config.vocab_size,
+            ActiMode.AC_MODE_NONE,
+            False,
+            name="embed_tokens_weight_lm_head",
+        )
+
+        if self.mode == InferenceMode.BEAM_SEARCH_MODE:
+            softmax = ffmodel.softmax(lm_head, -1)
+            # output = ffmodel.beam_top_k(softmax, self.opt_config.max_beam_width, False)
+            output = ffmodel.argmax(softmax, True)
+        else:
+            if self.generation_config.do_sample:
+                dense = ffmodel.scalar_true_divide(
+                    lm_head, self.generation_config.temperature, False
+                )
+                softmax = ffmodel.softmax(dense, -1)
+                output = ffmodel.sampling(softmax, self.generation_config.topp)
+            else:
+                # output = ffmodel.arg_top_k(lm_head, 1, False)
+                output = ffmodel.argmax(lm_head, False)
+
+        self.ffmodel = ffmodel
+
+    def convert_hf_model(model, dst_folder):
+        os.makedirs(dst_folder, exist_ok=True)
+        for name, params in model.named_parameters():
+            name = (
+                name.replace(".", "_")
+                .replace("decoder_", "")
+                .replace("model_", "")
+                .replace("self_attn", "attention")
+                .replace("q_proj", "wq")
+                .replace("k_proj", "wk")
+                .replace("v_proj", "wv")
+                .replace("out_proj", "wo")
+            )
+            params.detach().cpu().numpy().tofile(f"{dst_folder}/{name}")
+        # copy embedding weights
+        shutil.copy(
+            os.path.join(dst_folder, "embed_tokens_weight"),
+            os.path.join(dst_folder, "embed_tokens_weight_lm_head"),
+        )
+
+    def get_layers_with_weights(self):
+        layer_names = [
+            "embed_tokens_weight",
+            "embed_positions_weight",
+            "final_layer_norm_weight",
+            "embed_tokens_weight_lm_head",
+        ] + [
+            expr
+            for i in range(self.opt_config.num_hidden_layers)
+            for expr in (
+                f"layers_{i}_attention_layer_norm_weight",
+                f"layers_{i}_attention_weight",
+                f"layers_{i}_final_layer_norm_weight",
+                f"layers_{i}_fc1_weight",
+                f"layers_{i}_fc2_weight",
+            )
+        ]
+        layers_with_weights = {
+            layer_name: self.ffmodel.get_layer_by_name(layer_name)
+            for layer_name in layer_names
+        }
+
+        return layers_with_weights
diff --git a/python/flexflow/serve/models/starcoder.py b/python/flexflow/serve/models/starcoder.py
new file mode 100644
index 0000000000..922d0e4746
--- /dev/null
+++ b/python/flexflow/serve/models/starcoder.py
@@ -0,0 +1,286 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from flexflow.core import *
+from .base import FlexFlowModel
+import random, torch
+
+
+class STARCODERConfig:
+    def __init__(self, hf_config):
+        self.max_seq_len = 256
+        self.max_num_tokens = 64
+        self.max_beam_width = 1
+        self.max_beam_depth = 8
+        self.dropout_p = hf_config.attn_pdrop
+        self.hidden_size = hf_config.n_embd
+        self.layer_norm_epsilon = hf_config.layer_norm_epsilon
+        self.max_position_embeddings = hf_config.n_positions
+        self.num_attention_heads = hf_config.n_head
+        self.num_hidden_layers = hf_config.n_layer
+        self.vocab_size = hf_config.vocab_size
+        self.intermediate_size = hf_config.n_inner
+        self.n_head_kv = 1 if hf_config.multi_query else hf_config.n_head
+
+
+class FlexFlowSTARCODER(FlexFlowModel):
+    def __init__(
+        self,
+        mode,
+        generation_config,
+        ffconfig,
+        hf_config,
+        data_type,
+        max_batch_size=1,
+        max_seq_length=256,
+        max_tokens_per_batch=64,
+        weights_filepath="",
+        tokenizer_filepath="",
+    ):
+        self.mode = mode
+        self.generation_config = generation_config
+        self.ffconfig = ffconfig
+        self.max_batch_size = max_batch_size
+        self.data_type = data_type
+        self.starcoder_config = STARCODERConfig(hf_config)
+        self.starcoder_config.max_seq_length = max_seq_length
+        self.starcoder_config.max_num_tokens = max_tokens_per_batch
+        self.weights_filepath = weights_filepath
+        self.tokenizer_filepath = tokenizer_filepath
+        self.maxint = 2**31 - 1
+
+        # Sanity checks
+        if (
+            self.starcoder_config.hidden_size
+            % self.starcoder_config.num_attention_heads
+            != 0
+        ):
+            raise ValueError(
+                f"Hidden size ({self.starcoder_config.hidden_size}) is not divisible by n_head ({self.starcoder_config.num_attention_heads})"
+            )
+
+        # Sanity checks
+        if (
+            self.starcoder_config.num_attention_heads
+            < self.ffconfig.tensor_parallelism_degree
+            or self.starcoder_config.num_attention_heads
+            % self.ffconfig.tensor_parallelism_degree
+            != 0
+        ):
+            raise ValueError(
+                f"Number of attention heads ({self.starcoder_config.num_attention_heads}) is smaller, or not divisible by tensor parallelism degree ({self.ffconfig.tensor_parallelism_degree})"
+            )
+        if (
+            self.starcoder_config.n_head_kv < self.ffconfig.tensor_parallelism_degree
+            or self.starcoder_config.n_head_kv % self.ffconfig.tensor_parallelism_degree
+            != 0
+        ):
+            raise ValueError(
+                f"Number of k/v attention heads ({self.starcoder_config.n_head_kv}) is smaller, or not divisible by tensor parallelism degree ({self.ffconfig.tensor_parallelism_degree})"
+            )
+            
+        self.build_model()
+
+    def build_model(self):
+        ffmodel = FFModel(self.ffconfig)
+
+        tokens_dims = [self.starcoder_config.max_num_tokens, 1]
+        input_tensor = ffmodel.create_tensor(tokens_dims, DataType.DT_INT32)
+        position_tensor = ffmodel.create_tensor(tokens_dims, DataType.DT_INT32)
+
+        embed_init = UniformInitializer(random.randint(0, self.maxint), 0, 0)
+        ffmodel.set_position_offset(0)
+        token = ffmodel.embedding(
+            input_tensor,
+            self.starcoder_config.vocab_size,
+            self.starcoder_config.hidden_size,
+            AggrMode.AGGR_MODE_NONE,
+            self.data_type,
+            None,
+            embed_init,
+            name="transformer_wte_weight",
+        )
+        positional_embedding = ffmodel.embedding(
+            position_tensor,
+            self.starcoder_config.max_position_embeddings,
+            self.starcoder_config.hidden_size,
+            AggrMode.AGGR_MODE_NONE,
+            self.data_type,
+            None,
+            embed_init,
+            name="transformer_wpe_weight",
+        )
+
+        hidden_states = ffmodel.add(token, positional_embedding)
+
+        axes = [
+            0,
+        ]
+
+        for i in range(self.starcoder_config.num_hidden_layers):
+            ffmodel.set_transformer_layer_id(i)
+            ln_1 = ffmodel.layer_norm(
+                hidden_states,
+                axes,
+                True,
+                self.starcoder_config.layer_norm_epsilon,
+                name=f"layers_{i}_ln_1_weight",
+            )
+
+            assert self.mode == InferenceMode.INC_DECODING_MODE
+            mha = ffmodel.inc_multiquery_self_attention(
+                ln_1,
+                self.starcoder_config.hidden_size,
+                self.starcoder_config.num_attention_heads,
+                self.starcoder_config.n_head_kv,
+                self.starcoder_config.hidden_size
+                // self.starcoder_config.num_attention_heads,
+                self.starcoder_config.hidden_size
+                // self.starcoder_config.num_attention_heads,
+                0.0,  # dropout
+                True,  # bias
+                False,  # add_bias_kv
+                False,  # add_zero_attn
+                DataType.DT_NONE,  # data_type
+                None,  # kernel initializer
+                False,  # apply_rotary_embedding
+                name=f"layers_{i}_attention_weight",
+            )
+
+            residual = ffmodel.add(mha, hidden_states)
+
+            l2_norm = ffmodel.layer_norm(
+                residual,
+                axes,
+                True,
+                self.starcoder_config.layer_norm_epsilon,
+                name=f"layers_{i}_ln_2_weight",
+            )
+
+            # mlp
+
+            c_fc = ffmodel.dense(
+                l2_norm,
+                self.starcoder_config.intermediate_size,
+                ActiMode.AC_MODE_NONE,
+                True,
+                name=f"layers_{i}_mlp_c_fc_weight",
+            )
+            activation = ffmodel.gelu(c_fc, False)
+            c_proj = ffmodel.dense(
+                activation,
+                self.starcoder_config.hidden_size,
+                ActiMode.AC_MODE_NONE,
+                True,
+                name=f"layers_{i}_mlp_c_proj_weight",
+            )
+            hidden_states = ffmodel.add(residual, c_proj)
+
+        ln_f = ffmodel.layer_norm(
+            hidden_states,
+            axes,
+            True,
+            self.starcoder_config.layer_norm_epsilon,
+            name=f"transformer_ln_f_weight",
+        )
+        lm_head = ffmodel.dense(
+            ln_f,
+            self.starcoder_config.vocab_size,
+            ActiMode.AC_MODE_NONE,
+            False,
+            name="lm_head_weight",
+        )
+
+        if self.generation_config.do_sample:
+            dense = ffmodel.scalar_true_divide(
+                lm_head, self.generation_config.temperature, False
+            )
+            softmax = ffmodel.softmax(dense, -1)
+            output = ffmodel.sampling(softmax, self.generation_config.topp)
+        else:
+            output = ffmodel.argmax(lm_head, False)
+
+        self.ffmodel = ffmodel
+
+    def convert_hf_model(model, dst_folder):
+        os.makedirs(dst_folder, exist_ok=True)
+        for name, params in model.named_parameters():
+            name = name.replace("transformer.h", "layers").replace(".", "_")
+            if "c_attn_weight" in name:
+                name_q = name.replace("attn_c_attn", "attention_wq")
+                name_k = name.replace("attn_c_attn", "attention_wk")
+                name_v = name.replace("attn_c_attn", "attention_wv")
+                q, k, v = torch.split(
+                    params,
+                    [
+                        model.config.hidden_size,
+                        model.config.hidden_size // model.config.num_attention_heads,
+                        model.config.hidden_size // model.config.num_attention_heads,
+                    ],
+                    0,
+                )
+                q.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_q))
+                k.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_k))
+                v.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_v))
+            elif "c_attn_bias" in name:
+                name_q = name.replace("attn_c_attn", "attention_wq")
+                name_k = name.replace("attn_c_attn", "attention_wk")
+                name_v = name.replace("attn_c_attn", "attention_wv")
+                q, k, v = torch.split(
+                    params,
+                    [
+                        model.config.hidden_size,
+                        model.config.hidden_size // model.config.num_attention_heads,
+                        model.config.hidden_size // model.config.num_attention_heads,
+                    ],
+                    0,
+                )
+                q.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_q))
+                k.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_k))
+                v.detach().cpu().numpy().tofile(os.path.join(dst_folder, name_v))
+            elif "c_proj_bias" in name:
+                name = name.replace("attn_c_proj", "attention_wo")
+                params.detach().cpu().numpy().tofile(os.path.join(dst_folder, name))
+            elif "c_proj_weight" in name:
+                name = name.replace("attn_c_proj", "attention_wo")
+                params.detach().cpu().numpy().tofile(os.path.join(dst_folder, name))
+            else:
+                params.detach().cpu().numpy().tofile(os.path.join(dst_folder, name))
+        model.lm_head.weight.detach().cpu().numpy().tofile(
+            os.path.join(dst_folder, "lm_head_weight")
+        )
+
+    def get_layers_with_weights(self):
+        layer_names = [
+            "transformer_wte_weight",
+            "transformer_wpe_weight",
+            "transformer_ln_f_weight",
+            "lm_head_weight",
+        ] + [
+            expr
+            for i in range(self.starcoder_config.num_hidden_layers)
+            for expr in (
+                f"layers_{i}_ln_1_weight",
+                f"layers_{i}_attention_weight",
+                f"layers_{i}_ln_2_weight",
+                f"layers_{i}_mlp_c_fc_weight",
+                f"layers_{i}_mlp_c_proj_weight",
+            )
+        ]
+        layers_with_weights = {
+            layer_name: self.ffmodel.get_layer_by_name(layer_name)
+            for layer_name in layer_names
+        }
+
+        return layers_with_weights
diff --git a/python/flexflow/serve/serve.py b/python/flexflow/serve/serve.py
new file mode 100644
index 0000000000..dea21389d1
--- /dev/null
+++ b/python/flexflow/serve/serve.py
@@ -0,0 +1,462 @@
+# Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from flexflow.serve.models import (
+    FlexFlowLLAMA,
+    FlexFlowOPT,
+    FlexFlowFalcon,
+    FlexFlowSTARCODER,
+)
+from flexflow.core import *
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer
+from huggingface_hub import HfApi
+import sys, torch, shutil, hashlib
+from typing import Union, List
+
+
+class GenerationConfig:
+    """A class to store the sampling configs."""
+
+    def __init__(
+        self,
+        do_sample: bool = False,
+        temperature: float = 0.9,
+        topp: float = 0.8,
+        topk: int = 1,
+    ):
+        """Initialize the sampling configs
+
+        :param do_sample: Whether to perform sampling, or use greedy decoding, defaults to False
+        :type do_sample: bool, optional
+        :param temperature: The temperature setting, defaults to 0.9
+        :type temperature: float, optional
+        :param topp: The top probabilities (top-p) setting, defaults to 0.8
+        :type topp: float, optional
+        :param topk: The top-k setting, defaults to 1
+        :type topk: int, optional
+        """
+        self.do_sample = do_sample
+        self.temperature = temperature
+        self.topp = topp
+        self.topk = topk
+
+class GenerationResult:
+    """A class to store the output of a generation request."""
+    def __init__(self, text: str = None, tokens: list = None):
+        self.output_text = text
+        self.output_tokens = tokens
+
+class LLM:
+    """This class creates a LLM (Large-Language Model) object based on a model from HuggingFace"""
+
+    def __init__(
+        self,
+        model_name: str,
+        data_type: DataType = DataType.DT_HALF,
+        cache_path: str = "",
+        refresh_cache: bool = False,
+        output_file: str = "",
+    ):
+        """Create the LLM object
+
+        :param model_name: The name of the HuggingFace model to use. E.g. 'decapoda-research/llama-7b-hf'
+        :type model_name: str
+        :param data_type: The data type to use for the tensors (e.g. DataType.DT_FLOAT for full precision, or DataType.DT_HALF for half precision), defaults to DataType.DT_HALF
+        :type data_type: DataType, optional
+        :param cache_path: Path to the folder (which will be created if it does not yet exist) to use for the FlexFlow weights/tokenizers cache, defaults to "~/.cache/flexflow"
+        :type tokenizer_path: str, optional
+        :param refresh_cache: Use this flag to force the refresh of the model's weights/tokenizer cache, defaults to False
+        :type refresh_cache: bool, optional
+        :param output_file: Path to the output file. If left blank, the output will not be written to file, defaults to ""
+        :type output_file: str, optional
+        """
+        self.supported_models = {
+            "LlamaForCausalLM": (ModelType.LLAMA, FlexFlowLLAMA),
+            "LLaMAForCausalLM": (ModelType.LLAMA, FlexFlowLLAMA),
+            "OPTForCausalLM": (ModelType.OPT, FlexFlowOPT),
+            "RWForCausalLM": (ModelType.FALCON, FlexFlowFalcon),
+            "GPTBigCodeForCausalLM": (ModelType.STARCODER, FlexFlowSTARCODER),
+        }
+        self.hf_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
+        self.model_name = self.hf_config._name_or_path
+        self.model_type, self.model_class = self.__get_ff_model_type()
+        self.data_type = data_type
+        assert self.data_type == DataType.DT_HALF or self.data_type == DataType.DT_FLOAT
+        self.cache_path = cache_path if len(cache_path) > 0 else "~/.cache/flexflow"
+        self.refresh_cache = refresh_cache
+        self.output_file = output_file
+
+    def __get_ff_model_type(self):
+        architectures = getattr(self.hf_config, "architectures", [])
+        ff_arch = None
+        if next(iter(architectures), None) is not None:
+            ff_arch = self.supported_models.get(architectures[0])
+        if ff_arch is None:
+            print(
+                f"Huggingface model of type {architectures} is not yet supported by FlexFlow"
+            )
+            sys.exit(1)
+        return ff_arch
+
+    def download_hf_config(self):
+        """Save the HuggingFace model configs to a json file. Useful mainly to run the C++ inference code."""
+        self.config_dir = os.path.join(
+            os.path.expanduser(self.cache_path), "configs", self.model_name.lower()
+        )
+        self.config_path = os.path.join(self.config_dir, "config.json")
+        os.makedirs(self.config_dir, exist_ok=True)
+        print(f"Creating directory {self.config_dir} (if it doesn't exist)...")
+        print(f"Saving {self.model_name} configs to file {self.config_path}...")
+        self.hf_config.to_json_file(self.config_path)
+
+    def __get_revision_hashes(self, model_name: str, weights: bool):
+        ff_revision = None
+        ff_revision_file = os.path.join(self.weights_path, "rev_sha.txt") if weights else os.path.join(self.tokenizer_path, "rev_sha.txt")
+        if os.path.exists(ff_revision_file):
+            ff_revision = "".join(open(ff_revision_file).read().split())
+        
+        if os.path.exists(model_name) and os.path.isdir(model_name):
+            # Local model
+            files = os.listdir(model_name)
+            state = files + [os.path.getmtime(os.path.join(model_name, f)) for f in files]
+            latest_revision = hashlib.md5(str(state).encode('utf-8')).hexdigest() 
+        else:
+            # Remote HuggingFace model
+            hf_api = HfApi()
+            latest_revision = hf_api.model_info(self.model_name).sha
+        return ff_revision, ff_revision_file, latest_revision
+    
+    def download_hf_weights_if_needed(self):
+        """Check in the folder specified by the cache_path whether the LLM's model weights are available and up to date.
+        If not, or if the refresh_cache parameter is set to True, download new weights.
+        """
+        if self.data_type == DataType.DT_HALF:
+            torch.set_default_tensor_type(torch.HalfTensor)
+        elif self.data_type == DataType.DT_FLOAT:
+            torch.set_default_tensor_type(torch.FloatTensor)
+        else:
+            assert False, "Data type not yet supported -- cannot download weights!"
+
+        # Use local cache, or download new version
+        self.weights_path = os.path.join(
+            os.path.expanduser(self.cache_path),
+            "weights",
+            self.model_name.lower(),
+            "full-precision"
+            if self.data_type == DataType.DT_FLOAT
+            else "half-precision",
+        )
+        if self.refresh_cache:
+            print(
+                f"Refreshing weights in cache for model {self.model_name} at path {self.weights_path} ..."
+            )
+            if os.path.exists(self.weights_path):
+                shutil.rmtree(self.weights_path)
+        os.makedirs(self.weights_path, exist_ok=True)
+        print(f"Creating directory {self.weights_path} (if it doesn't exist)...")
+
+        ff_revision, ff_revision_file, latest_revision = self.__get_revision_hashes(self.model_name, weights=True)
+
+        # Download if needed
+        if ff_revision != latest_revision:
+            if not os.path.exists(self.model_name) or os.path.isdir(self.model_name):
+                # Local model
+                print(
+                    f"'{self.model_name}' model weights not found in cache or outdated. Downloading from huggingface.co ..."
+                )
+            else:
+                # Remote model
+                print(f"'{self.model_name}' local model weights were updated! Converting new weights now...")
+            # Download model from HuggingFace, or load it from the local folder
+            hf_model = AutoModelForCausalLM.from_pretrained(self.model_name, trust_remote_code=True)
+            # Print log message to notify user download of model has finished
+            if not os.path.exists(self.model_name) or os.path.isdir(self.model_name):
+                print("Done downloading HF weights. Converting them now...")
+            # Convert the model to FlexFlow format
+            self.model_class.convert_hf_model(hf_model, self.weights_path)
+            # Save new revision hash to file
+            with open(ff_revision_file, "w+") as f:
+                f.write(latest_revision)
+            print("Done converting the weights...")
+        else:
+            print(f"Loading '{self.model_name}' model weights from the cache...")
+
+    def download_hf_tokenizer_if_needed(self):
+        """Check in the folder specified by the cache_path whether the LLM's tokenizer files are available and up to date.
+        If not, or if the refresh_cache parameter is set to True, download new tokenizer files.
+        """
+        print("Loading tokenizer...")
+
+        # Use local cache, or download new version
+        self.tokenizer_path = os.path.join(
+            os.path.expanduser(self.cache_path),
+            "tokenizers",
+            self.model_name.lower(),
+        )
+        if self.refresh_cache:
+            print(
+                f"Discarding cached tokenizer files (if they exist) for model {self.model_name}..."
+            )
+            if os.path.exists(self.tokenizer_path):
+                shutil.rmtree(self.tokenizer_path)
+        if not os.path.exists(self.tokenizer_path):
+            print(f"Creating directory {self.tokenizer_path} (if it doesn't exist)...")
+            os.makedirs(self.tokenizer_path, exist_ok=True)
+
+        # Get local revision SHA, check if it matches latest one on huggingface
+        ff_revision, ff_revision_file, latest_revision = self.__get_revision_hashes(self.model_name, weights=False)
+
+        if ff_revision != latest_revision:
+            if not os.path.exists(self.model_name) or os.path.isdir(self.model_name):
+                # Local model
+                print(f"'{self.model_name}' tokenizer not found in cache or outdated. Downloading from huggingface.co ...")
+            else:
+                # Remote model
+                print(f"'{self.model_name}' local tokenizer was updated! Saving new tokenizer now...")
+            # Download tokenizer from HuggingFace, or load it from the local folder
+            if self.model_type == ModelType.LLAMA:
+                hf_tokenizer = LlamaTokenizer.from_pretrained(
+                    self.model_name, use_fast=True
+                )
+            else:
+                hf_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+            # Print log message to notify user download of tokenizer has finished
+            if not os.path.exists(self.model_name) or os.path.isdir(self.model_name):
+                print("Done downloading tokenizer. Saving it now...")
+            # Save tokenizer
+            hf_tokenizer.save_pretrained(self.tokenizer_path)
+            print("Done saving HF tokenizer.")
+            # Save new revision hash to file
+            with open(ff_revision_file, "w+") as f:
+                f.write(latest_revision)
+            
+        else:
+            print(f"Loading '{self.model_name}' tokenizer from the cache...")
+
+    def __load_hf_weights(self):
+        print("Loading hf weights...")
+
+        self.download_hf_weights_if_needed()
+
+        # Create file data loader, load weights into tensors
+        if (
+            self.model_type == ModelType.FALCON
+            or self.model_type == ModelType.STARCODER
+        ):
+            n_q_heads = self.hf_config.num_attention_heads
+            if "n_head_kv" in self.hf_config.__dict__:
+                n_kv_heads = self.hf_config.n_head_kv
+            else:
+                n_kv_heads = 1
+        else:
+            n_q_heads = n_kv_heads = self.hf_config.num_attention_heads
+        self.fileloader = FileDataLoader(
+            self.weights_path,
+            n_q_heads,
+            n_kv_heads,
+            self.hf_config.hidden_size,
+            self.hf_config.hidden_size // n_q_heads,
+            self.ffconfig.tensor_parallelism_degree,
+        )
+
+        model_layers_with_weights = self.model.get_layers_with_weights()
+        self.fileloader.load_weights(
+            self.model.ffmodel, model_layers_with_weights, self.data_type
+        )
+
+    def compile(
+        self,
+        generation_config: GenerationConfig = GenerationConfig(),
+        max_batch_size: int = 1,
+        max_seq_length: int = 256,
+        max_tokens_per_batch: int = 64,
+        model_specific_data_parallelism_degree: int = None,
+        model_specific_tensor_parallelism_degree: int = None,
+        model_specific_pipeline_parallelism_degree: int = None,
+        ssms: list = [],
+    ):
+        """Compile the LLM for inference and load the weights into memory
+
+        :param mode: The LLM inference mode (InferenceMode.INC_DECODING_MODE for incremental decoding, InferenceMode.BEAM_SEARCH_MODE for beam search, or InferenceMode.TREE_VERIFY_MODE for token tree verification), defaults to InferenceMode.INC_DECODING_MODE
+        :type mode: InferenceMode, optional
+        :param generation_config: The GenerationConfig object with the configurations to use for sampling, defaults to GenerationConfig()
+        :type generation_config: GenerationConfig, optional
+        :param max_batch_size: The maximum batch size to allow, defaults to 1
+        :type max_batch_size: int, optional
+        :param max_seq_length: The maximum sequence length to allow per batch, defaults to 256
+        :type max_seq_length: int, optional
+        :param max_tokens_per_batch: The maximum number of tokens (across requests) to allow per batch, defaults to 64
+        :type max_tokens_per_batch: int, optional
+        :param model_specific_data_parallelism_degree: Use this parameter if you want to give the LLM a different data parallelism degree than the one used to initialize the runtime, defaults to None
+        :type model_specific_data_parallelism_degree: int, optional
+        :param model_specific_tensor_parallelism_degree: Use this parameter if you want to give the LLM a different tensor parallelism degree than the one used to initialize the runtime, defaults to None
+        :type model_specific_tensor_parallelism_degree: int, optional
+        :param model_specific_pipeline_parallelism_degree: Use this parameter if you want to give the LLM a different pipeline parallelism degree than the one used to initialize the runtime, defaults to None
+        :type model_specific_pipeline_parallelism_degree: int, optional
+        :param ssms: The SSMs to use when operating in speculative inference mode, defaults to []
+        :type ssms: list, optional
+        """
+        self.max_batch_size = max_batch_size
+        self.max_seq_length = max_seq_length
+        self.max_tokens_per_batch = max_tokens_per_batch
+        self.ssms = ssms
+        self.generation_config = GenerationConfig()
+        self.ffconfig = FFConfig()
+        if len(ssms) > 0:
+            assert type(self) == LLM
+            mode = InferenceMode.TREE_VERIFY_MODE
+        elif type(self) == SSM:
+            mode = InferenceMode.BEAM_SEARCH_MODE
+        else:
+            assert type(self) == LLM
+            mode = InferenceMode.INC_DECODING_MODE
+
+        # Apply model-specific parallelism degrees, if needed
+        if model_specific_data_parallelism_degree:
+            self.ffconfig.data_parallelism_degree = (
+                model_specific_data_parallelism_degree
+            )
+        if model_specific_tensor_parallelism_degree:
+            self.ffconfig.tensor_parallelism_degree = (
+                model_specific_tensor_parallelism_degree
+            )
+        if model_specific_pipeline_parallelism_degree:
+            self.ffconfig.pipeline_parallelism_degree = (
+                model_specific_pipeline_parallelism_degree
+            )
+
+        # Instantiate the relevant model
+        self.model = self.model_class(
+            mode,
+            generation_config,
+            self.ffconfig,
+            self.hf_config,
+            self.data_type,
+            max_batch_size,
+            max_seq_length,
+            max_tokens_per_batch,
+        )
+
+        # Create inference manager
+        self.im = InferenceManager()
+        self.im.compile_model_and_allocate_buffer(self.model.ffmodel)
+
+        # Download the weights and tokenizer from huggingface (if needed) and load them
+        self.__load_hf_weights()
+        self.download_hf_tokenizer_if_needed()
+
+        # Create request manager
+        self.rm = RequestManager()
+        self.rm.register_tokenizer(self.model_type, self.hf_config.bos_token_id, self.hf_config.eos_token_id, self.tokenizer_path)
+        self.rm.register_output_filepath(self.output_file)
+
+        self.im.init_operators_inference(self.model.ffmodel)
+
+        for ssm in self.ssms:
+            self.rm.register_ssm_model(ssm.model.ffmodel)
+
+    def generate(self, prompts: Union[str, List[str]], max_length: int = 128):
+        """Generate tokens based on the input prompt(s)
+
+        :param prompts: The generation prompt(s) in the form of a string, or list of strings
+        :type prompts: Union[str, List[str]]
+        :return: the generation results
+        :rtype: GenerationResult
+        """
+        if type(prompts) == str:
+            if len(prompts) == 0:
+                return None
+            return self.model.ffmodel.generate(prompts, max_length)
+        elif type(prompts) == list:
+            if len(prompts) == 0:
+                return []
+            return [self.model.ffmodel.generate(prompt, max_length) for prompt in prompts]
+        else:
+            assert False, "Please pass a non-empty string or list of strings"
+
+
+class SSM(LLM):
+    """This class creates a SSM (Small-Speculative Model) object based on a model from HuggingFace"""
+
+    def __init__(
+        self,
+        model_name: str,
+        data_type: DataType = DataType.DT_HALF,
+        cache_path: str = "~/.cache/flexflow",
+        refresh_cache: bool = False,
+        output_file: str = "",
+    ):
+        """Create the SSM object
+
+        :param model_name: The name of the HuggingFace model to use. E.g. 'decapoda-research/llama-7b-hf'
+        :type model_name: str
+        :param data_type: The data type to use for the tensors (e.g. DataType.DT_FLOAT for full precision, or DataType.DT_HALF for half precision), defaults to DataType.DT_HALF
+        :type data_type: DataType, optional
+        :param cache_path: Path to the folder (which will be created if it does not yet exist) to use for the FlexFlow weights/tokenizers cache, defaults to "~/.cache/flexflow"
+        :type tokenizer_path: str, optional
+        :param refresh_cache: Use this flag to force the refresh of the model's weights/tokenizer cache, defaults to False
+        :type refresh_cache: bool, optional
+        :param output_file: Path to the output file. If left blank, the output will not be written to file, defaults to ""
+        :type output_file: str, optional
+        """
+        super().__init__(
+            model_name,
+            data_type,
+            cache_path,
+            refresh_cache,
+            output_file,
+        )
+
+    def compile(
+        self,
+        generation_config: GenerationConfig = GenerationConfig(),
+        max_batch_size: int = 1,
+        max_seq_length: int = 256,
+        max_tokens_per_batch: int = 64,
+        model_specific_data_parallelism_degree: int = 1,
+        model_specific_tensor_parallelism_degree: int = 1,
+        model_specific_pipeline_parallelism_degree: int = 1,
+        ssms: list = [],
+    ):
+        """Compile the SSM for inference and load the weights into memory
+
+        :param mode: The SSM inference mode (InferenceMode.INC_DECODING_MODE for incremental decoding, InferenceMode.BEAM_SEARCH_MODE for beam search, or InferenceMode.TREE_VERIFY_MODE for token tree verification), defaults to InferenceMode.INC_DECODING_MODE
+        :type mode: InferenceMode, optional
+        :param generation_config: The GenerationConfig object with the configurations to use for sampling, defaults to GenerationConfig()
+        :type generation_config: GenerationConfig, optional
+        :param max_batch_size: The maximum batch size to allow, defaults to 1
+        :type max_batch_size: int, optional
+        :param max_seq_length: The maximum sequence length to allow per batch, defaults to 256
+        :type max_seq_length: int, optional
+        :param max_tokens_per_batch: The maximum number of tokens (across requests) to allow per batch, defaults to 64
+        :type max_tokens_per_batch: int, optional
+        :param model_specific_data_parallelism_degree: Use this parameter if you want to give the SSM a different data parallelism degree than the default one, defaults to 1
+        :type model_specific_data_parallelism_degree: int, optional
+        :param model_specific_tensor_parallelism_degree: Use this parameter if you want to give the SSM a different tensor parallelism degree than the default one, defaults to 1
+        :type model_specific_tensor_parallelism_degree: int, optional
+        :param model_specific_pipeline_parallelism_degree: Use this parameter if you want to give the SSM a different pipeline parallelism degree than the default one, defaults to 1
+        :type model_specific_pipeline_parallelism_degree: int, optional
+        :param ssms: The SSMs to use when operating in speculative inference mode, defaults to []
+        :type ssms: list, optional
+        """
+        super().compile(
+            generation_config,
+            max_batch_size,
+            max_seq_length,
+            max_tokens_per_batch,
+            model_specific_data_parallelism_degree,
+            model_specific_tensor_parallelism_degree,
+            model_specific_pipeline_parallelism_degree,
+            ssms,
+        )
diff --git a/python/flexflow/type.py b/python/flexflow/type.py
index 0412e9d0cd..5232ddd431 100644
--- a/python/flexflow/type.py
+++ b/python/flexflow/type.py
@@ -2,142 +2,180 @@
 
 from enum import Enum
 
+
 class ActiMode(Enum):
-  AC_MODE_NONE = 10
-  AC_MODE_RELU = 11
-  AC_MODE_SIGMOID = 12
-  AC_MODE_TANH = 13
-  AC_MODE_GELU = 14
+    AC_MODE_NONE = 10
+    AC_MODE_RELU = 11
+    AC_MODE_SIGMOID = 12
+    AC_MODE_TANH = 13
+    AC_MODE_GELU = 14
+
 
 class RegularizerMode(Enum):
-  REG_MODE_NONE = 17
-  REG_MODE_L1 = 18
-  REG_MODE_L2 = 19
+    REG_MODE_NONE = 17
+    REG_MODE_L1 = 18
+    REG_MODE_L2 = 19
+
 
 class AggrMode(Enum):
-  AGGR_MODE_NONE = 20
-  AGGR_MODE_SUM = 21
-  AGGR_MODE_AVG = 22
+    AGGR_MODE_NONE = 20
+    AGGR_MODE_SUM = 21
+    AGGR_MODE_AVG = 22
+
 
 class PoolType(Enum):
-  POOL_MAX = 30
-  POOL_AVG = 31
+    POOL_MAX = 30
+    POOL_AVG = 31
+
 
 class DataType(Enum):
-  DT_BOOLEAN = 40
-  DT_INT32 = 41
-  DT_INT64 = 42
-  DT_HALF = 43
-  DT_FLOAT = 44
-  DT_DOUBLE = 45
-  DT_NONE = 49
+    DT_BOOLEAN = 40
+    DT_INT32 = 41
+    DT_INT64 = 42
+    DT_HALF = 43
+    DT_FLOAT = 44
+    DT_DOUBLE = 45
+    DT_NONE = 49
+
 
 class LossType(Enum):
-  LOSS_CATEGORICAL_CROSSENTROPY = 50
-  LOSS_SPARSE_CATEGORICAL_CROSSENTROPY = 51
-  LOSS_MEAN_SQUARED_ERROR_AVG_REDUCE = 52
-  LOSS_MEAN_SQUARED_ERROR_SUM_REDUCE = 53
-  LOSS_IDENTITY = 54
+    LOSS_CATEGORICAL_CROSSENTROPY = 50
+    LOSS_SPARSE_CATEGORICAL_CROSSENTROPY = 51
+    LOSS_MEAN_SQUARED_ERROR_AVG_REDUCE = 52
+    LOSS_MEAN_SQUARED_ERROR_SUM_REDUCE = 53
+    LOSS_IDENTITY = 54
+
 
 class CompMode(Enum):
-  TRAINING = 70
-  INFERENCE = 71
-  
+    TRAINING = 70
+    INFERENCE = 71
+
+
 class ParameterSyncType(Enum):
-  NONE = 80
-  PS = 81
-  NCCL = 82
-  
+    NONE = 80
+    PS = 81
+    NCCL = 82
+
+
 class MetricsType(Enum):
-  METRICS_ACCURACY = 1001
-  METRICS_CATEGORICAL_CROSSENTROPY = 1002
-  METRICS_SPARSE_CATEGORICAL_CROSSENTROPY = 1004
-  METRICS_MEAN_SQUARED_ERROR = 1008
-  METRICS_ROOT_MEAN_SQUARED_ERROR = 1016
-  METRICS_MEAN_ABSOLUTE_ERROR=1032
+    METRICS_ACCURACY = 1001
+    METRICS_CATEGORICAL_CROSSENTROPY = 1002
+    METRICS_SPARSE_CATEGORICAL_CROSSENTROPY = 1004
+    METRICS_MEAN_SQUARED_ERROR = 1008
+    METRICS_ROOT_MEAN_SQUARED_ERROR = 1016
+    METRICS_MEAN_ABSOLUTE_ERROR = 1032
+
+
+class InferenceMode(Enum):
+    INC_DECODING_MODE = 2001
+    BEAM_SEARCH_MODE = 2002
+    TREE_VERIFY_MODE = 2003
+
+
+class ModelType(Enum):
+    UNKNOWN = 3001
+    LLAMA = 3002
+    LLAMA2 = 3003
+    OPT = 3004
+    FALCON = 3005
+    STARCODER = 3006
+
 
 class OpType(Enum):
-  CONV2D = 2011
-  EMBEDDING = 2012
-  POOL2D = 2013
-  LINEAR = 2014
-  SOFTMAX = 2015
-  CONCAT = 2016
-  FLAT = 2017
-  MSELOSS = 2020
-  BATCH_NORM = 2021
-  RELU = 2022
-  SIGMOID = 2023
-  TANH = 2024
-  ELU = 2025
-  DROPOUT = 2026
-  BATCH_MATMUL = 2027
-  SPLIT = 2028
-  RESHAPE = 2029
-  TRANSPOSE = 2030
-  REVERSE = 2031
-  EXP = 2040
-  ADD = 2041
-  SUBTRACT = 2042
-  MULTIPLY = 2043
-  DIVIDE = 2044
-  POW = 2045
-  MEAN = 2046
-  RSQRT = 2047
-  SIN = 2048
-  COS = 2049
-  INPUT = 2050
-  OUTPUT = 2051
-  REDUCE_SUM = 2052
-  MAX = 2053
-  MIN = 2054
-  MULTIHEAD_ATTENTION = 2060
-  GETITEM = 2070
-  GETATTR = 2080
-  EXPAND = 2081
-  LAYER_NORM = 2082
-  FLOOR_DIVIDE = 2083
-  IDENTITY = 2084
-  GELU = 2085
-  PERMUTE = 2086
-  SCALAR_MULTIPLY = 2087
-  SCALAR_FLOORDIV = 2088
-  SCALAR_ADD = 2089
-  SCALAR_SUB = 2090
-  SCALAR_TRUEDIV = 2091
-  INIT_PARAM = 2092
-  FLOAT = 2100
-  CONTIGUOUS = 2101
-  TO = 2102
-  UNSQUEEZE = 2103
-  TYPE_AS = 2104
-  VIEW = 2105
-  GATHER = 2106
-  ATTRIBUTE = 2200
+    CONV2D = 2011
+    EMBEDDING = 2012
+    POOL2D = 2013
+    LINEAR = 2014
+    SOFTMAX = 2015
+    CONCAT = 2016
+    FLAT = 2017
+    MSELOSS = 2020
+    BATCH_NORM = 2021
+    RELU = 2022
+    SIGMOID = 2023
+    TANH = 2024
+    ELU = 2025
+    DROPOUT = 2026
+    BATCH_MATMUL = 2027
+    SPLIT = 2028
+    RESHAPE = 2029
+    TRANSPOSE = 2030
+    REVERSE = 2031
+    EXP = 2040
+    ADD = 2041
+    SUBTRACT = 2042
+    MULTIPLY = 2043
+    DIVIDE = 2044
+    POW = 2045
+    MEAN = 2046
+    RSQRT = 2047
+    SIN = 2048
+    COS = 2049
+    INPUT = 2050
+    OUTPUT = 2051
+    REDUCE_SUM = 2052
+    MAX = 2053
+    MIN = 2054
+    MULTIHEAD_ATTENTION = 2060
+    INC_MULTIHEAD_ATTENTION = 2061
+    SPEC_INC_MULTIHEAD_SELF_ATTENTION = 2062
+    TREE_INC_MULTIHEAD_SELF_ATTENTION = 2063
+    SAMPLING = 2065
+    ARGMAX = 2066
+    GETITEM = 2070
+    GETATTR = 2080
+    EXPAND = 2081
+    LAYER_NORM = 2082
+    FLOOR_DIVIDE = 2083
+    IDENTITY = 2084
+    GELU = 2085
+    PERMUTE = 2086
+    SCALAR_MULTIPLY = 2087
+    SCALAR_FLOORDIV = 2088
+    SCALAR_ADD = 2089
+    SCALAR_SUB = 2090
+    SCALAR_TRUEDIV = 2091
+    INIT_PARAM = 2092
+    FLOAT = 2100
+    CONTIGUOUS = 2101
+    TO = 2102
+    UNSQUEEZE = 2103
+    TYPE_AS = 2104
+    VIEW = 2105
+    GATHER = 2106
+    ATTRIBUTE = 2200
+    RMS_NORM = 2300
+    ARG_TOPK = 2301
+    BEAM_TOPK = 2302
+
+
 def enum_to_int(enum, enum_item):
-  for item in enum:
-    if (enum_item == item):
-      return item.value
+    for item in enum:
+        if enum_item == item:
+            return item.value
+
+    print(enum_item)
+    print(enum)
+    assert 0, "unknown enum type " + str(enum_item) + " " + str(enum)
+    return -1
 
-  print(enum_item)
-  print(enum)
-  assert 0, "unknown enum type " + str(enum_item) + " " + str(enum)
-  return -1
 
 def int_to_enum(enum, value):
-  for item in enum:
-    if (item.value == value):
-      return item
+    for item in enum:
+        if item.value == value:
+            return item
+
+    assert 0, "unknown enum value " + str(value) + " " + str(enum)
+
 
-  assert 0, "unknown enum value " + str(value) + " " + str(enum)
-  
 def enum_to_str(enum, enum_item):
-  name = enum(enum_item).name
-  return name
-  
+    name = enum(enum_item).name
+    return name
+
+
 def str_to_enum(enum, value):
-  for item in enum:
-    if (item.name == value):
-      return item
+    for item in enum:
+        if item.name == value:
+            return item
 
-  assert 0, "unknown enum value " + value + " " + str(enum)
+    assert 0, "unknown enum value " + value + " " + str(enum)
diff --git a/python/flexflow_python_build.py b/python/flexflow_python_build.py
index 4ca26d8ab3..a9d8e8983e 100755
--- a/python/flexflow_python_build.py
+++ b/python/flexflow_python_build.py
@@ -29,14 +29,15 @@
     sys.exit(1)
 build_dir = os.path.abspath(build_dir)
 script_dir = os.path.abspath(os.path.dirname(__file__))
-script_path = os.path.join(build_dir, "flexflow_python")
 if not os.path.isdir(build_dir):
     print(f"Folder {build_dir} does not exist")
     sys.exit(1)
 if not os.path.isdir(script_dir):
     print(f"Folder {script_dir} does not exist")
     sys.exit(1)
-script_path = os.path.abspath(script_path)
+# Build flexflow_python script
+flexflow_python_path = os.path.join(build_dir, "flexflow_python")
+flexflow_python_path = os.path.abspath(flexflow_python_path)
 lines = [
     '#! /usr/bin/env bash',
     f'BUILD_FOLDER="{build_dir}"',
@@ -52,10 +53,26 @@
     '\tlegion_python "$@"',
     'fi'
 ]
-
-with open(script_path, "w+") as script_file:
+with open(flexflow_python_path, "w+") as flexflow_python_file:
     for line in lines:
-        script_file.write(line + "\n")
+        flexflow_python_file.write(line + "\n")
+cur_stat = os.stat(flexflow_python_path)
+os.chmod(flexflow_python_path, cur_stat.st_mode | stat.S_IEXEC)
 
-cur_stat = os.stat(script_path)
-os.chmod(script_path, cur_stat.st_mode | stat.S_IEXEC)
+# Build set_python_envs.sh
+python_envs_path = os.path.join(build_dir, "set_python_envs.sh")
+python_envs_path = os.path.abspath(python_envs_path)
+lines = [
+    '#! /usr/bin/env bash',
+    f'BUILD_FOLDER="{build_dir}"',
+    f'PYTHON_FOLDER="{script_dir}"',
+    'PYLIB_PATH="$("$PYTHON_FOLDER"/flexflow/findpylib.py)"',
+    'PYLIB_DIR="$(dirname "$PYLIB_PATH")"',
+    'export LD_LIBRARY_PATH="$BUILD_FOLDER:$BUILD_FOLDER/deps/legion/lib:$PYLIB_DIR:$LD_LIBRARY_PATH"',
+    'export PYTHONPATH="$PYTHON_FOLDER:$BUILD_FOLDER/deps/legion/bindings/python:$PYTHONPATH"',
+]
+with open(python_envs_path, "w+") as python_envs_file:
+    for line in lines:
+        python_envs_file.write(line + "\n")
+cur_stat = os.stat(python_envs_path)
+os.chmod(python_envs_path, cur_stat.st_mode | stat.S_IEXEC)
diff --git a/requirements.txt b/requirements.txt
index 4ac0a8a047..1037661337 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -7,3 +7,11 @@ pybind11
 cmake-build-extension
 ninja
 requests
+regex
+torch>=1.13.1
+torchaudio>=0.13.1
+torchvision>=0.14.1
+onnx
+transformers>=4.31.0
+sentencepiece
+einops
diff --git a/scripts/install_tokenizer.sh b/scripts/install_tokenizer.sh
new file mode 100755
index 0000000000..4632b7e818
--- /dev/null
+++ b/scripts/install_tokenizer.sh
@@ -0,0 +1,9 @@
+#! /usr/bin/env bash
+set -x
+set -e
+
+# Cd into directory holding this script
+cd "${BASH_SOURCE[0]%/*}"
+cd ../deps/tokenizers-cpp/example
+cmake -D CMAKE_CXX_FLAGS=-fPIC
+make -j
diff --git a/src/c/flexflow_c.cc b/src/c/flexflow_c.cc
index 22ad739dd9..96ff84c85f 100644
--- a/src/c/flexflow_c.cc
+++ b/src/c/flexflow_c.cc
@@ -16,6 +16,8 @@
 #include "flexflow/flexflow_c.h"
 #include "flexflow/dataloader.h"
 #include "flexflow/mapper.h"
+#include "flexflow/request_manager.h"
+#include "inference/file_loader.h"
 
 using namespace Legion;
 using namespace FlexFlow;
@@ -55,6 +57,16 @@ class FFCObjectWrapper {
   FF_NEW_OPAQUE_WRAPPER(flexflow_net_config_t, NetConfig *);
   FF_NEW_OPAQUE_WRAPPER(flexflow_dlrm_config_t, DLRMConfig *);
   FF_NEW_OPAQUE_WRAPPER(flexflow_single_dataloader_t, SingleDataLoader *);
+  // inference
+  FF_NEW_OPAQUE_WRAPPER(flexflow_batch_config_t, BatchConfig *);
+  FF_NEW_OPAQUE_WRAPPER(flexflow_tree_verify_batch_config_t,
+                        TreeVerifyBatchConfig *);
+  FF_NEW_OPAQUE_WRAPPER(flexflow_beam_search_batch_config_t,
+                        BeamSearchBatchConfig *);
+  FF_NEW_OPAQUE_WRAPPER(flexflow_inference_manager_t, InferenceManager *);
+  FF_NEW_OPAQUE_WRAPPER(flexflow_request_manager_t, RequestManager *);
+  FF_NEW_OPAQUE_WRAPPER(flexflow_file_data_loader_t, FileDataLoader *);
+  FF_NEW_OPAQUE_WRAPPER(flexflow_generation_result_t, GenerationResult *);
 };
 
 Logger ffc_log("flexflow_c");
@@ -121,18 +133,56 @@ bool flexflow_config_get_enable_control_replication(flexflow_config_t handle_) {
   return handle->enable_control_replication;
 }
 
+int flexflow_config_get_data_parallelism_degree(flexflow_config_t handle_) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  return handle->data_parallelism_degree;
+}
+
+int flexflow_config_get_tensor_parallelism_degree(flexflow_config_t handle_) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  return handle->tensor_parallelism_degree;
+}
+
+int flexflow_config_get_pipeline_parallelism_degree(flexflow_config_t handle_) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  return handle->pipeline_parallelism_degree;
+}
+
+void flexflow_config_set_data_parallelism_degree(flexflow_config_t handle_,
+                                                 int value) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  handle->data_parallelism_degree = value;
+}
+
+void flexflow_config_set_tensor_parallelism_degree(flexflow_config_t handle_,
+                                                   int value) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  handle->tensor_parallelism_degree = value;
+}
+
+void flexflow_config_set_pipeline_parallelism_degree(flexflow_config_t handle_,
+                                                     int value) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  handle->pipeline_parallelism_degree = value;
+}
+
 int flexflow_config_get_python_data_loader_type(flexflow_config_t handle_) {
   FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
   return handle->python_data_loader_type;
 }
+bool flexflow_config_get_offload(flexflow_config_t handle_) {
+  FFConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  return handle->cpu_offload;
+}
 
 // -----------------------------------------------------------------------
 // FFModel
 // -----------------------------------------------------------------------
 
-flexflow_model_t flexflow_model_create(flexflow_config_t config_) {
+flexflow_model_t flexflow_model_create(flexflow_config_t config_,
+                                       bool cpu_offload) {
   FFConfig *config = FFCObjectWrapper::unwrap(config_);
-  FFModel *model = new FFModel(*config);
+  FFModel *model = new FFModel(*config, cpu_offload);
   DEBUG_PRINT("[FFModel] new %p", model);
   return FFCObjectWrapper::wrap(model);
 }
@@ -456,9 +506,10 @@ flexflow_tensor_t
 flexflow_tensor_t
     flexflow_model_add_embedding(flexflow_model_t handle_,
                                  const flexflow_tensor_t input_,
-                                 int num_entires,
+                                 int num_entries,
                                  int out_dim,
                                  enum AggrMode aggr,
+                                 DataType dtype,
                                  flexflow_op_t shared_op_,
                                  flexflow_initializer_t kernel_initializer_,
                                  char const *name) {
@@ -470,20 +521,21 @@ flexflow_tensor_t
   // TODO: update the flexflow_c and Python API to support other data types
   // Currently we assume it's float
   Tensor tensor = handle->embedding(input,
-                                    num_entires,
+                                    num_entries,
                                     out_dim,
                                     aggr,
-                                    DT_FLOAT,
+                                    dtype,
                                     shared_op,
                                     kernel_initializer,
                                     name);
-  DEBUG_PRINT("[Embedding] new Tensor %p, input %p, num_entires %d, out_dim "
-              "%d, aggr %d, shared_op %p, kernel_init %p, name %s",
+  DEBUG_PRINT("[Embedding] new Tensor %p, input %p, num_entries %d, out_dim "
+              "%d, aggr %d, dtype %d, shared_op %p, kernel_init %p, name %s",
               tensor,
               input,
-              num_entires,
+              num_entries,
               out_dim,
               aggr,
+              dtype,
               shared_op,
               kernel_initializer,
               name);
@@ -568,8 +620,8 @@ flexflow_tensor_t flexflow_model_add_layer_norm(flexflow_model_t handle_,
   for (int i = 0; i < n; i++) {
     axes_vec.push_back(axes[i]);
   }
-  Tensor tensor =
-      handle->layer_norm(input, axes_vec, elementwise_affine, eps, name);
+  Tensor tensor = handle->layer_norm(
+      input, axes_vec, elementwise_affine, eps, input->data_type, name);
   DEBUG_PRINT("[LayerNorm] new Tensor %p, input %p, elementwise_affine %d, eps "
               "%f, name %s",
               tensor,
@@ -737,7 +789,7 @@ flexflow_tensor_t flexflow_model_add_softmax(flexflow_model_t handle_,
                                              char const *name) {
   FFModel *handle = FFCObjectWrapper::unwrap(handle_);
   Tensor input = FFCObjectWrapper::unwrap(input_);
-  Tensor tensor = handle->softmax(input, dim, name);
+  Tensor tensor = handle->softmax(input, dim, input->data_type, name);
   DEBUG_PRINT(
       "[Softmax] new Tensor %p, input %p, name %s", tensor, input, name);
   return FFCObjectWrapper::wrap(tensor);
@@ -979,6 +1031,7 @@ flexflow_tensor_t flexflow_model_add_multihead_attention(
                                               bias,
                                               add_bias_kv,
                                               add_zero_attn,
+                                              query->data_type,
                                               kernel_initializer,
                                               name);
   DEBUG_PRINT("[MultiHeadAttention] new Tensor %p, query %p, key %p, value %p, "
@@ -1001,6 +1054,315 @@ flexflow_tensor_t flexflow_model_add_multihead_attention(
   return FFCObjectWrapper::wrap(tensor);
 }
 
+flexflow_tensor_t flexflow_model_add_inc_multihead_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Initializer *kernel_initializer =
+      FFCObjectWrapper::unwrap(kernel_initializer_);
+  Tensor tensor = handle->inc_multihead_self_attention(input,
+                                                       embed_dim,
+                                                       num_heads,
+                                                       kdim,
+                                                       vdim,
+                                                       dropout,
+                                                       bias,
+                                                       add_bias_kv,
+                                                       add_zero_attn,
+                                                       data_type,
+                                                       kernel_initializer,
+                                                       apply_rotary_embedding,
+                                                       scaling_query,
+                                                       scaling_factor,
+                                                       qk_prod_scaling,
+                                                       name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_spec_inc_multihead_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Initializer *kernel_initializer =
+      FFCObjectWrapper::unwrap(kernel_initializer_);
+  Tensor tensor =
+      handle->spec_inc_multihead_self_attention(input,
+                                                embed_dim,
+                                                num_heads,
+                                                kdim,
+                                                vdim,
+                                                dropout,
+                                                bias,
+                                                add_bias_kv,
+                                                add_zero_attn,
+                                                data_type,
+                                                kernel_initializer,
+                                                apply_rotary_embedding,
+                                                scaling_query,
+                                                scaling_factor,
+                                                qk_prod_scaling,
+                                                name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_inc_multihead_self_attention_verify(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Initializer *kernel_initializer =
+      FFCObjectWrapper::unwrap(kernel_initializer_);
+  Tensor tensor =
+      handle->inc_multihead_self_attention_verify(input,
+                                                  embed_dim,
+                                                  num_heads,
+                                                  kdim,
+                                                  vdim,
+                                                  dropout,
+                                                  bias,
+                                                  add_bias_kv,
+                                                  add_zero_attn,
+                                                  data_type,
+                                                  kernel_initializer,
+                                                  apply_rotary_embedding,
+                                                  scaling_query,
+                                                  scaling_factor,
+                                                  qk_prod_scaling,
+                                                  name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_inc_multiquery_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Initializer *kernel_initializer =
+      FFCObjectWrapper::unwrap(kernel_initializer_);
+  Tensor tensor = handle->inc_multiquery_self_attention(input,
+                                                        embed_dim,
+                                                        num_q_heads,
+                                                        num_kv_heads,
+                                                        kdim,
+                                                        vdim,
+                                                        dropout,
+                                                        bias,
+                                                        add_bias_kv,
+                                                        add_zero_attn,
+                                                        data_type,
+                                                        kernel_initializer,
+                                                        apply_rotary_embedding,
+                                                        scaling_query,
+                                                        scaling_factor,
+                                                        qk_prod_scaling,
+                                                        name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_spec_inc_multiquery_self_attention(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Initializer *kernel_initializer =
+      FFCObjectWrapper::unwrap(kernel_initializer_);
+  Tensor tensor =
+      handle->spec_inc_multiquery_self_attention(input,
+                                                 embed_dim,
+                                                 num_q_heads,
+                                                 num_kv_heads,
+                                                 kdim,
+                                                 vdim,
+                                                 dropout,
+                                                 bias,
+                                                 add_bias_kv,
+                                                 add_zero_attn,
+                                                 data_type,
+                                                 kernel_initializer,
+                                                 apply_rotary_embedding,
+                                                 scaling_query,
+                                                 scaling_factor,
+                                                 qk_prod_scaling,
+                                                 name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_inc_multiquery_self_attention_verify(
+    flexflow_model_t handle_,
+    const flexflow_tensor_t input_,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    enum DataType data_type,
+    flexflow_initializer_t kernel_initializer_,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Initializer *kernel_initializer =
+      FFCObjectWrapper::unwrap(kernel_initializer_);
+  Tensor tensor =
+      handle->inc_multiquery_self_attention_verify(input,
+                                                   embed_dim,
+                                                   num_q_heads,
+                                                   num_kv_heads,
+                                                   kdim,
+                                                   vdim,
+                                                   dropout,
+                                                   bias,
+                                                   add_bias_kv,
+                                                   add_zero_attn,
+                                                   data_type,
+                                                   kernel_initializer,
+                                                   apply_rotary_embedding,
+                                                   scaling_query,
+                                                   scaling_factor,
+                                                   qk_prod_scaling,
+                                                   name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_rms_norm(flexflow_model_t handle_,
+                                              const flexflow_tensor_t input_,
+                                              float eps,
+                                              int dim,
+                                              char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Tensor tensor = handle->rms_norm(input, eps, dim, input->data_type, name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_arg_top_k(flexflow_model_t handle_,
+                                               const flexflow_tensor_t input_,
+                                               int k,
+                                               bool sorted,
+                                               char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Tensor tensor = handle->arg_top_k(input, k, sorted, name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_beam_top_k(flexflow_model_t handle_,
+                                                const flexflow_tensor_t input_,
+                                                int max_beam_size,
+                                                bool sorted,
+                                                char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Tensor tensor = handle->beam_top_k(input, max_beam_size, sorted, name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_sampling(flexflow_model_t handle_,
+                                              const flexflow_tensor_t input_,
+                                              float top_p,
+                                              char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Tensor tensor = handle->sampling(input, top_p, name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
+flexflow_tensor_t flexflow_model_add_argmax(flexflow_model_t handle_,
+                                            const flexflow_tensor_t input_,
+                                            bool beam_search,
+                                            char const *name) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  Tensor input = FFCObjectWrapper::unwrap(input_);
+  Tensor tensor = handle->argmax(input, beam_search, name);
+  return FFCObjectWrapper::wrap(tensor);
+}
+
 void flexflow_model_set_sgd_optimizer(flexflow_model_t handle_,
                                       flexflow_sgd_optimizer_t optimizer_) {
   FFModel *handle = FFCObjectWrapper::unwrap(handle_);
@@ -1049,6 +1411,38 @@ flexflow_perf_metrics_t
   return FFCObjectWrapper::wrap(perf_metrics);
 }
 
+void flexflow_model_set_transformer_layer_id(flexflow_model_t handle_, int id) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  handle->set_transformer_layer_id(id);
+}
+
+flexflow_generation_result_t
+    flexflow_model_generate(flexflow_model_t handle_,
+                            char const *input_text,
+                            int max_num_chars,
+                            char *output_text,
+                            int max_seq_length,
+                            int *output_length_and_tokens) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  std::string const text_str(input_text);
+  GenerationResult result = handle->generate(text_str, max_seq_length);
+  DEBUG_PRINT("[Model] generate %p %s %i", handle, text, max_seq_length);
+  assert(result.output_tokens.size() <= max_seq_length);
+  output_length_and_tokens[0] = result.output_tokens.size();
+  std::copy(result.output_tokens.begin(),
+            result.output_tokens.end(),
+            output_length_and_tokens + 1);
+  std::memcpy(
+      output_text, result.output_text.c_str(), result.output_text.length());
+  return FFCObjectWrapper::wrap(&result);
+}
+
+void flexflow_model_set_position_offset(flexflow_model_t handle_,
+                                        int const offset) {
+  FFModel *handle = FFCObjectWrapper::unwrap(handle_);
+  handle->set_position_offset(offset);
+}
+
 // -----------------------------------------------------------------------
 // Tensor
 // -----------------------------------------------------------------------
@@ -1928,3 +2322,177 @@ void flexflow_perform_registration(void) {
   Runtime::perform_registration_callback(FFMapper::update_mappers,
                                          true /*global*/);
 }
+
+// -----------------------------------------------------------------------
+// BatchConfig
+// -----------------------------------------------------------------------
+
+flexflow_batch_config_t flexflow_batch_config_create(void) {
+  BatchConfig *config = new BatchConfig();
+  DEBUG_PRINT("[BatchConfig] new %p", config);
+  return FFCObjectWrapper::wrap(config);
+}
+
+void flexflow_batch_config_destroy(flexflow_batch_config_t handle_) {
+  BatchConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  DEBUG_PRINT("[BatchConfig] delete %p", handle);
+  delete handle;
+}
+
+// -----------------------------------------------------------------------
+// TreeVerifyBatchConfig
+// -----------------------------------------------------------------------
+
+flexflow_tree_verify_batch_config_t
+    flexflow_tree_verify_batch_config_create(void) {
+  TreeVerifyBatchConfig *config = new TreeVerifyBatchConfig();
+  DEBUG_PRINT("[TreeVerifyBatchConfig] new %p", config);
+  return FFCObjectWrapper::wrap(config);
+}
+
+void flexflow_tree_verify_batch_config_destroy(
+    flexflow_tree_verify_batch_config_t handle_) {
+  TreeVerifyBatchConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  DEBUG_PRINT("[TreeVerifyBatchConfig] delete %p", handle);
+  delete handle;
+}
+
+// -----------------------------------------------------------------------
+// BeamSearchBatchConfig
+// -----------------------------------------------------------------------
+
+flexflow_beam_search_batch_config_t
+    flexflow_beam_search_batch_config_create(void) {
+  BeamSearchBatchConfig *config = new BeamSearchBatchConfig();
+  DEBUG_PRINT("[BeamSearchBatchConfig] new %p", config);
+  return FFCObjectWrapper::wrap(config);
+}
+
+void flexflow_beam_search_batch_config_destroy(
+    flexflow_beam_search_batch_config_t handle_) {
+  BeamSearchBatchConfig *handle = FFCObjectWrapper::unwrap(handle_);
+  DEBUG_PRINT("[BeamSearchBatchConfig] delete %p", handle);
+  delete handle;
+}
+
+// -----------------------------------------------------------------------
+// RequestManager
+// -----------------------------------------------------------------------
+
+flexflow_request_manager_t flexflow_request_manager_get_request_manager(void) {
+  RequestManager *rm = RequestManager::get_request_manager();
+  DEBUG_PRINT("[RequestManager] get %p", rm);
+  return FFCObjectWrapper::wrap(rm);
+}
+
+void flexflow_request_manager_register_tokenizer(
+    flexflow_request_manager_t handle_,
+    enum ModelType model_type,
+    int bos_token_id,
+    int eos_token_id,
+    char const *tokenizer_filepath) {
+  RequestManager *handle = FFCObjectWrapper::unwrap(handle_);
+  assert(tokenizer_filepath != nullptr &&
+         "Cannot convert nullptr char * to std::string");
+  std::string const tokenizer_filepath_str(tokenizer_filepath);
+  handle->register_tokenizer(
+      model_type, bos_token_id, eos_token_id, tokenizer_filepath_str);
+  DEBUG_PRINT(
+      "[RequestManager] register tokenizer %p %s", handle, tokenizer_filepath);
+}
+
+void flexflow_request_manager_register_output_filepath(
+    flexflow_request_manager_t handle_, char const *output_filepath) {
+  RequestManager *handle = FFCObjectWrapper::unwrap(handle_);
+  assert(output_filepath != nullptr &&
+         "Cannot convert nullptr char * to std::string");
+  std::string const output_filepath_str(output_filepath);
+  handle->register_output_filepath(output_filepath_str);
+  DEBUG_PRINT("[RequestManager] register output filepath %p %s",
+              handle,
+              output_filepath);
+}
+
+int flexflow_request_manager_register_ssm_model(
+    flexflow_request_manager_t handle_, flexflow_model_t model_handle_) {
+  RequestManager *handle = FFCObjectWrapper::unwrap(handle_);
+  FFModel *model_handle = FFCObjectWrapper::unwrap(model_handle_);
+  DEBUG_PRINT("[RequestManager] register ssm %p %p", handle, model_handle);
+  return handle->register_ssm_model(model_handle);
+}
+
+// -----------------------------------------------------------------------
+// InferenceManager
+// -----------------------------------------------------------------------
+
+flexflow_inference_manager_t
+    flexflow_inference_manager_get_inference_manager() {
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  DEBUG_PRINT("[InferenceManager] get %p", im);
+  return FFCObjectWrapper::wrap(im);
+}
+
+void flexflow_inference_manager_compile_model_and_allocate_buffer(
+    flexflow_inference_manager_t handle_, flexflow_model_t model_handle) {
+  InferenceManager *handle = FFCObjectWrapper::unwrap(handle_);
+  FFModel *model = FFCObjectWrapper::unwrap(model_handle);
+  DEBUG_PRINT("[InferenceManager] compile_model_and_allocate_buffer %p",
+              handle);
+  handle->compile_model_and_allocate_buffer(model);
+}
+
+void flexflow_inference_manager_init_operators_inference(
+    flexflow_inference_manager_t handle_, flexflow_model_t model_handle) {
+  InferenceManager *handle = FFCObjectWrapper::unwrap(handle_);
+  FFModel *model = FFCObjectWrapper::unwrap(model_handle);
+  DEBUG_PRINT("[InferenceManager] init_operators_inference %p", handle);
+  handle->init_operators_inference(model);
+}
+
+// -----------------------------------------------------------------------
+// FileDataLoader
+// -----------------------------------------------------------------------
+
+flexflow_file_data_loader_t
+    flexflow_file_data_loader_create(char const *weight_file_path,
+                                     int num_q_heads,
+                                     int num_kv_heads,
+                                     int hidden_dim,
+                                     int qkv_inner_dim,
+                                     int tensor_parallelism_degree) {
+  assert(weight_file_path != nullptr &&
+         "Cannot convert nullptr char * to std::string");
+  std::string const weight_file_path_str(weight_file_path);
+  FileDataLoader *handle = new FileDataLoader("",
+                                              weight_file_path_str,
+                                              num_q_heads,
+                                              num_kv_heads,
+                                              hidden_dim,
+                                              qkv_inner_dim,
+                                              tensor_parallelism_degree);
+  DEBUG_PRINT("[FileDataLoader] new %p", handle);
+  return FFCObjectWrapper::wrap(handle);
+}
+
+void flexflow_file_data_loader_destroy(flexflow_file_data_loader_t handle_) {
+  FileDataLoader *handle = FFCObjectWrapper::unwrap(handle_);
+  DEBUG_PRINT("[FileDataLoader] delete %p", handle);
+  delete handle;
+}
+
+void flexflow_file_data_loader_load_weights(flexflow_file_data_loader_t handle_,
+                                            flexflow_model_t model_handle_,
+                                            int num_layers,
+                                            char const **layer_names,
+                                            flexflow_op_t *layers,
+                                            bool use_full_precision) {
+  FileDataLoader *handle = FFCObjectWrapper::unwrap(handle_);
+  FFModel *model = FFCObjectWrapper::unwrap(model_handle_);
+  std::unordered_map<std::string, Layer *> weights_layers;
+  for (int i = 0; i < num_layers; i++) {
+    std::string const layer_name(layer_names[i]);
+    Layer *layer_ptr = FFCObjectWrapper::unwrap(layers[i]);
+    weights_layers.emplace(layer_name, layer_ptr);
+  }
+  handle->load_weights(model, weights_layers, use_full_precision);
+}
diff --git a/src/mapper/mapper.cc b/src/mapper/mapper.cc
index 643435f207..3d08eb0bcc 100644
--- a/src/mapper/mapper.cc
+++ b/src/mapper/mapper.cc
@@ -283,6 +283,13 @@ void FFMapper::select_task_options(const MapperContext ctx,
     output.initial_proc = all_cpus[0];
     return;
   }
+  if ((task.task_id == RM_PREPARE_NEXT_BATCH_TASK_ID) ||
+      (task.task_id == RM_PREPARE_NEXT_BATCH_BEAM_TASK_ID) ||
+      (task.task_id == RM_PREPARE_NEXT_BATCH_INIT_TASK_ID) ||
+      (task.task_id == RM_PREPARE_NEXT_BATCH_VERIFY_TASK_ID)) {
+    output.initial_proc = all_cpus[0];
+    return;
+  }
   if (task.task_id == TOP_LEVEL_TASK_ID) {
     output.initial_proc = all_cpus[0];
     // control replicate top level task
@@ -349,6 +356,11 @@ void FFMapper::select_task_options(const MapperContext ctx,
     }
   }
 
+  if (task.task_id == TENSOR_EQUAL_TASK_ID) {
+    output.initial_proc = all_cpus[0];
+    return;
+  }
+
   // Assert that all single tasks should be handled and returned before
   // So task must be an indextask
   if (!task.is_index_space) {
diff --git a/src/ops/aggregate.cc b/src/ops/aggregate.cc
index 0ad9d91d62..c7217bb700 100644
--- a/src/ops/aggregate.cc
+++ b/src/ops/aggregate.cc
@@ -166,6 +166,47 @@ Aggregate::Aggregate(FFModel &model,
                      char const *name)
     : Aggregate(model, inputs.data(), params.n, params.lambda_bal, name) {}
 
+using PCG::Node;
+Node Aggregate::deserialize(FFModel &ff,
+                            Legion::Deserializer &dez,
+                            std::vector<ParallelTensor> const &inputs,
+                            int num_inputs) {
+  int n;
+  float lambda_bal;
+  dez.deserialize(n);
+  dez.deserialize(lambda_bal);
+  assert(num_inputs == n + 4);
+  AggregateParams params;
+  params.n = n;
+  params.lambda_bal = lambda_bal;
+  return ff.get_or_create_node<Aggregate>(inputs, params);
+}
+
+void Aggregate::init_inference(FFModel const &ff,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(AGGREGATE_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Aggregate)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void Aggregate::init(FFModel const &ff) {
   assert(check_output_input_weight_same_parallel_is());
   parallel_is = outputs[0]->parallel_is;
@@ -204,7 +245,7 @@ void Aggregate::forward(FFModel const &ff) {
   set_argumentmap_for_forward(ff, argmap);
   IndexLauncher launcher(AGGREGATE_FWD_TASK_ID,
                          parallel_is,
-                         TaskArgument(this, sizeof(Aggregate)),
+                         TaskArgument(nullptr, 0),
                          argmap,
                          Predicate::TRUE_PRED,
                          false /*must*/,
@@ -243,14 +284,68 @@ void Aggregate::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap Aggregate::inference(FFModel const &ff,
+                               BatchConfigFuture const &bc,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "Aggregate op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(AGGREGATE_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // gate_preds
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // gate_assign
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  // exp_preds
+  for (int i = 0; i < n; i++) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_inputs[i + 4]->part,
+                          0 /*projection id*/,
+                          READ_WRITE,
+                          EXCLUSIVE,
+                          batch_inputs[i + 4]->region));
+    launcher.add_field(i + 2, FID_DATA);
+  }
+  // output
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(n + 2, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void Aggregate::forward_task(Task const *task,
                              std::vector<PhysicalRegion> const &regions,
                              Context ctx,
                              Runtime *runtime) {
-  int n = ((Aggregate *)task->args)->n;
-
-  assert((int)regions.size() == n + 3);
-  assert((int)task->regions.size() == n + 3);
+  assert(regions.size() == task->regions.size());
+  int n = regions.size() - 3;
 
   AggregateMeta const *m = *((AggregateMeta **)task->local_args);
 
diff --git a/src/ops/aggregate_spec.cc b/src/ops/aggregate_spec.cc
index 749d071310..5190983148 100644
--- a/src/ops/aggregate_spec.cc
+++ b/src/ops/aggregate_spec.cc
@@ -155,6 +155,32 @@ AggregateSpec::AggregateSpec(FFModel &model,
   numWeights = 0;
 }
 
+void AggregateSpec::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(AGG_SPEC_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(AggregateSpec)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void AggregateSpec::init(FFModel const &ff) {
   assert(check_output_input_weight_same_parallel_is());
   parallel_is = outputs[0]->parallel_is;
@@ -193,7 +219,7 @@ void AggregateSpec::forward(FFModel const &ff) {
   set_argumentmap_for_forward(ff, argmap);
   IndexLauncher launcher(AGG_SPEC_FWD_TASK_ID,
                          parallel_is,
-                         TaskArgument(this, sizeof(AggregateSpec)),
+                         TaskArgument(NULL, 0),
                          argmap,
                          Predicate::TRUE_PRED,
                          false /*must*/,
@@ -232,13 +258,70 @@ void AggregateSpec::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap
+    AggregateSpec::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "AggregateSpec op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(AGG_SPEC_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // gate_preds
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // gate_assign
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  // exp_preds
+  for (int i = 0; i < n; i++) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_inputs[i + 4]->part,
+                          0 /*projection id*/,
+                          READ_WRITE,
+                          EXCLUSIVE,
+                          batch_inputs[i + 4]->region));
+    launcher.add_field(i + 2, FID_DATA);
+  }
+  // output
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(n + 2, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void AggregateSpec::forward_task(Task const *task,
                                  std::vector<PhysicalRegion> const &regions,
                                  Context ctx,
                                  Runtime *runtime) {
-  int n = ((AggregateSpec *)task->args)->n;
+  assert(regions.size() == task->regions.size());
+  int n = regions.size() - 3;
 
-  assert((int)regions.size() == n + 3);
   assert((int)task->regions.size() == n + 3);
 
   AggregateSpecMeta const *m = *((AggregateSpecMeta **)task->local_args);
diff --git a/src/ops/arg_topk.cc b/src/ops/arg_topk.cc
new file mode 100644
index 0000000000..b877a9f96d
--- /dev/null
+++ b/src/ops/arg_topk.cc
@@ -0,0 +1,383 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/arg_topk.h"
+#include "flexflow/model.h"
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::InlineLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+// For an input tensor, computes the top k entries in each row
+// (resp. vector along the last dimension). Thus,
+// values.shape = indices.shape = input.shape[:-1] + [k]
+Tensor FFModel::arg_top_k(const Tensor input,
+                          int k,
+                          bool sorted,
+                          char const *name) {
+  Layer *li = new Layer(this,
+                        OP_ARG_TOPK,
+                        input->data_type,
+                        name,
+                        1 /*inputs*/,
+                        0 /*weights*/,
+                        1 /*outputs*/,
+                        input);
+  {
+    int numdims = input->num_dims;
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    dims[0] = k;
+    // li->outputs[0] = create_tensor_legion_ordering(
+    //     numdims, dims, input->data_type, li, 0, true /*create_grad*/);
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, DT_INT32, li, 0, false /*create_grad*/);
+  }
+  li->add_int_property("k", k);
+  li->add_int_property("sorted", sorted);
+  layers.push_back(li);
+  // outputs[0] = li->outputs[0];
+  // outputs[1] = li->outputs[1];
+  return li->outputs[0];
+}
+
+Op *ArgTopK::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  long long value;
+  layer->get_int_property("k", value);
+  int k = value;
+  layer->get_int_property("sorted", value);
+  bool sorted = (bool)value;
+  return new ArgTopK(
+      model, layer->layer_guid, inputs[0], k, sorted, layer->name);
+}
+
+ArgTopKParams ArgTopK::get_params() const {
+  ArgTopKParams params;
+  params.k = this->k;
+  params.sorted = this->sorted;
+  return params;
+}
+
+bool ArgTopKParams::is_valid(ParallelTensorShape const &) const {
+  // topk is always valid
+  return true;
+}
+
+bool operator==(ArgTopKParams const &lhs, ArgTopKParams const &rhs) {
+  return lhs.k == rhs.k && lhs.sorted == rhs.sorted;
+}
+
+ArgTopK::ArgTopK(FFModel &model,
+                 LayerID const &_layer_guid,
+                 const ParallelTensor _input,
+                 int _k,
+                 bool _sorted,
+                 char const *name)
+    : Op(model,
+         OP_ARG_TOPK,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         0 /*weights*/,
+         1 /*outputs*/,
+         _input),
+      k(_k), sorted(_sorted) {
+  // overwrite layer_guid
+  layer_guid = _layer_guid;
+  int numdim = inputs[0]->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = inputs[0]->dims[i];
+  }
+  dims[0].size = k;
+  assert(inputs[0]->dims[0].degree == 1);
+  assert(inputs[0]->dims[0].parallel_idx == -1);
+  //   outputs[0] = model.create_parallel_tensor_legion_ordering(
+  //       numdim, dims, _input->data_type, this, 0 /*owner_idx*/);
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      numdim, dims, DT_INT32, this, 0 /*owner_idx*/);
+}
+
+ArgTopK::ArgTopK(FFModel &model,
+                 LayerID const &layer_guid,
+                 ArgTopK const &other,
+                 const ParallelTensor input)
+    : ArgTopK(model, layer_guid, input, other.k, other.sorted, other.name) {}
+
+ArgTopK::ArgTopK(FFModel &model,
+                 ArgTopKParams const &params,
+                 const ParallelTensor input,
+                 char const *name)
+    : ArgTopK(model, params.layer_guid, input, params.k, params.sorted, name) {}
+
+void ArgTopK::init_inference(FFModel const &ff,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(ARG_TOPK_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(ArgTopK)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  //   launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+  //                                                     0 /*projection id*/,
+  //                                                     WRITE_ONLY,
+  //                                                     EXCLUSIVE,
+  //                                                     batch_outputs[1]->region));
+  //   launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void ArgTopK::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(ARG_TOPK_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(ArgTopK)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  //   launcher.add_region_requirement(RegionRequirement(outputs[1]->part,
+  //                                                     0 /*projection id*/,
+  //                                                     WRITE_ONLY,
+  //                                                     EXCLUSIVE,
+  //                                                     outputs[1]->region));
+  //   launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+OpMeta *ArgTopK::init_task(Task const *task,
+                           std::vector<PhysicalRegion> const &regions,
+                           Context ctx,
+                           Runtime *runtime) {
+  ArgTopK *topk = (ArgTopK *)task->args;
+  FFHandler handle = *((FFHandler *)task->local_args);
+  ArgTopKMeta *m = new ArgTopKMeta(handle, topk);
+  m->profiling = topk->profiling;
+  m->sorted = topk->sorted;
+  return m;
+}
+
+void ArgTopK::forward(FFModel const &ff) {
+  // ArgTopK does not support forward
+  assert(false);
+}
+
+FutureMap ArgTopK::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "ArgTopK op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(ARG_TOPK_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  //   launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+  //                                                     0 /*projection id*/,
+  //                                                     WRITE_ONLY,
+  //                                                     EXCLUSIVE,
+  //                                                     batch_outputs[1]->region));
+  //   launcher.add_field(2, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+InferenceResult
+    ArgTopK::inference_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  assert(regions.size() == 2);
+  assert(task->regions.size() == 2);
+  // const ArgTopK* topk = (const ArgTopK*) task->args;
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  if (bc->num_tokens == 0) {
+    // Directly return for empty batch config
+    InferenceResult ir;
+    return ir;
+  }
+  ArgTopKMeta const *m = *((ArgTopKMeta **)task->local_args);
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW indices = helperGetGenericTensorAccessorWO(
+      DT_INT32, regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  int batch_size = bc->num_active_tokens();
+  ArgTopK::forward_kernel_wrapper(m, input, indices, batch_size);
+
+  InferenceResult ir;
+  download_tensor<BatchConfig::TokenId>(
+      indices.get_int32_ptr(), ir.token_ids, batch_size);
+  return ir;
+}
+
+void ArgTopK::backward(FFModel const &ff) {
+  // ArgTopK does not support backward
+  assert(false);
+}
+
+void ArgTopK::serialize(Legion::Serializer &sez) const {
+  sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
+  sez.serialize(this->k);
+  sez.serialize(this->sorted);
+}
+
+Node ArgTopK::deserialize(FFModel &ff,
+                          Legion::Deserializer &dez,
+                          ParallelTensor inputs[],
+                          int num_inputs) {
+  assert(num_inputs == 1);
+  size_t id, transformer_layer_id;
+  dez.deserialize(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
+  int k;
+  bool sorted;
+  dez.deserialize(k);
+  dez.deserialize(sorted);
+  ArgTopKParams params;
+  params.layer_guid = layer_guid;
+  params.k = k;
+  params.sorted = sorted;
+  return ff.get_or_create_node<ArgTopK>(inputs[0], params);
+}
+
+Op *ArgTopK::materialize(FFModel &ff,
+                         ParallelTensor inputs[],
+                         int num_inputs) const {
+  ArgTopKParams params = get_params();
+  return new ArgTopK(ff, params, inputs[0], this->name);
+}
+
+bool ArgTopK::measure_operator_cost(Simulator *sim,
+                                    MachineView const &mv,
+                                    CostMetrics &cost_metrics) const {
+  return false;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::ArgTopKParams>::operator()(
+    FlexFlow::ArgTopKParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.k);
+  hash_combine(key, params.sorted);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/arg_topk.cpp b/src/ops/arg_topk.cpp
new file mode 100644
index 0000000000..4937166b66
--- /dev/null
+++ b/src/ops/arg_topk.cpp
@@ -0,0 +1,492 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/arg_topk.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::coord_t;
+
+enum class HeapType { kMinHeap, kMaxHeap };
+enum class PreferIndices { kLower, kHigher };
+
+template <typename T>
+struct Entry {
+  int index;
+  T value;
+};
+
+template <typename T>
+struct LinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return data[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+};
+
+template <typename T>
+struct IndirectLinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return backing_data[data[i].index].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+  Entry *const backing_data;
+};
+
+template <typename T>
+struct StridedData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index * blockDim.x + threadIdx.x];
+  }
+
+  __device__ int get_index(int i) const {
+    return (*this)[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return (*this)[i].value;
+  }
+
+  Entry *const data;
+};
+
+// A heap of Entry<T> that can either work as a min-heap or as a max-heap.
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+struct IndexedHeap {
+  typedef typename Data<T>::Entry Entry;
+  Data<T> const data;
+  __device__ IndexedHeap(Data<T> const &d) : data(d) {}
+
+  __device__ bool is_above(int left, int right) {
+    T left_value = data.get_value(left);
+    T right_value = data.get_value(right);
+    if (left_value == right_value) {
+      if (preferIndices == PreferIndices::kLower) {
+        return data.get_index(left) < data.get_index(right);
+      } else {
+        return data.get_index(left) > data.get_index(right);
+      }
+    }
+    if (heapType == HeapType::kMinHeap) {
+      return left_value < right_value;
+    } else {
+      return left_value > right_value;
+    }
+  }
+
+  __device__ void assign(int i, Entry const &entry) {
+    data[i] = entry;
+  }
+
+  __device__ void push_up(int i) {
+    int child = i;
+    int parent;
+    for (; child > 0; child = parent) {
+      parent = (child - 1) / 2;
+      if (!is_above(child, parent)) {
+        // Heap property satisfied.
+        break;
+      }
+      swap(child, parent);
+    }
+  }
+
+  __device__ void swap(int a, int b) {
+    auto tmp = data[b];
+    data[b] = data[a];
+    data[a] = tmp;
+  }
+
+  __device__ void push_root_down(int k) {
+    push_down(0, k);
+  }
+
+  // MAX-HEAPIFY in Cormen
+  __device__ void push_down(int node, int k) {
+    while (true) {
+      int const left = 2 * node + 1;
+      int const right = left + 1;
+      int smallest = node;
+      if (left < k && is_above(left, smallest)) {
+        smallest = left;
+      }
+      if (right < k && is_above(right, smallest)) {
+        smallest = right;
+      }
+      if (smallest == node) {
+        break;
+      }
+      swap(smallest, node);
+      node = smallest;
+    }
+  }
+
+  // BUILD-MAX-HEAPIFY in Cormen
+  __device__ void build(int k) {
+    for (int node = (k - 1) / 2; node >= 0; node--) {
+      push_down(node, k);
+    }
+  }
+
+  // HEAP-EXTRACT-MAX in Cormen
+  __device__ void remove_root(int k) {
+    data[0] = data[k - 1];
+    push_root_down(k - 1);
+  }
+
+  // in-place HEAPSORT in Cormen
+  // This method destroys the heap property.
+  __device__ void sort(int k) {
+    for (int slot = k - 1; slot > 0; slot--) {
+      // This is like remove_root but we insert the element at the end.
+      swap(slot, 0);
+      // Heap is now an element smaller.
+      push_root_down(/*k=*/slot);
+    }
+  }
+
+  __device__ void replace_root(Entry const &entry, int k) {
+    data[0] = entry;
+    push_root_down(k);
+  }
+
+  __device__ Entry const &root() {
+    return data[0];
+  }
+};
+
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+__device__ IndexedHeap<heapType, preferIndices, Data, T>
+    make_indexed_heap(typename Data<T>::Entry *data) {
+  return IndexedHeap<heapType, preferIndices, Data, T>{Data<T>{data}};
+}
+
+// heapArgTopK walks over [input, input+length) with `step_size` stride starting
+// at `start_index`. It builds a top-`k` heap that is stored in `heap_entries`
+// using `Accessor` to access elements in `heap_entries`. If sorted=true, the
+// elements will be sorted at the end.
+template <typename T, template <typename> class Data = LinearData>
+__device__ void heapArgTopK(T const *__restrict__ input,
+                            int length,
+                            int k,
+                            Entry<T> *__restrict__ heap_entries,
+                            bool sorted = false,
+                            int start_index = 0,
+                            int step_size = 1) {
+  assert(k <= length);
+
+  auto heap =
+      make_indexed_heap<HeapType::kMinHeap, PreferIndices::kHigher, Data, T>(
+          heap_entries);
+
+  int heap_end_index = start_index + k * step_size;
+  if (heap_end_index > length) {
+    heap_end_index = length;
+  }
+  // Initialize the min-heap.
+  for (int index = start_index, slot = 0; index < heap_end_index;
+       index += step_size, slot++) {
+    heap.assign(slot, {index, input[index]});
+  }
+
+  heap.build(k);
+
+  // Now iterate over the remaining items.
+  // If an item is smaller than the min element, it is not amongst the top k.
+  // Otherwise, replace the min element with it and push upwards.
+  for (int index = heap_end_index; index < length; index += step_size) {
+    // We prefer elements with lower indices. This is given here.
+    // Later elements automatically have higher indices, so can be discarded.
+    if (input[index] > heap.root().value) {
+      // This element should replace the min.
+      heap.replace_root({index, input[index]}, k);
+    }
+  }
+
+  // Sort if wanted.
+  if (sorted) {
+    heap.sort(k);
+  }
+}
+
+// mergeShards performs a top-k merge on `num_shards` many sorted streams that
+// are sorted and stored in `entries` in a strided way:
+// |s_1 1st|s_2 1st|...s_{num_shards} 1st|s_1 2nd|s_2 2nd|...
+// The overall top k elements are written to `top_k_values` and their indices
+// to top_k_indices.
+// `top_k_heap` is used as temporary storage for the merge heap.
+template <typename T>
+__device__ void mergeShards(int num_shards,
+                            int k,
+                            Entry<T> *__restrict__ entries,
+                            Entry<T> *__restrict__ top_k_heap,
+                            // T *top_k_values,
+                            int *top_k_indices) {
+  // If k < num_shards, we can use a min-heap with k elements to get the top k
+  // of the sorted blocks.
+  // If k > num_shards, we can initialize a min-heap with the top element from
+  // each sorted block.
+  int const heap_size = k < num_shards ? k : num_shards;
+
+  // Min-heap part.
+  {
+    auto min_heap = IndexedHeap<HeapType::kMinHeap,
+                                PreferIndices::kHigher,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Initialize the heap as a min-heap.
+    for (int slot = 0; slot < heap_size; slot++) {
+      min_heap.assign(slot, {slot, entries[slot].value});
+    }
+    min_heap.build(heap_size);
+
+    // Now perform top k with the remaining shards (if num_shards > heap_size).
+    for (int shard = heap_size; shard < num_shards; shard++) {
+      auto const entry = entries[shard];
+      auto const root = min_heap.root();
+      if (entry.value < root.value) {
+        continue;
+      }
+      if (entry.value == root.value &&
+          entry.index > entries[root.index].index) {
+        continue;
+      }
+      // This element should replace the min.
+      min_heap.replace_root({shard, entry.value}, heap_size);
+    }
+  }
+
+  // Max-part.
+  {
+    // Turn the min-heap into a max-heap in-place.
+    auto max_heap = IndexedHeap<HeapType::kMaxHeap,
+                                PreferIndices::kLower,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Heapify into a max heap.
+    max_heap.build(heap_size);
+
+    // Now extract the minimum k-1 times.
+    // k is treated specially.
+    int const last_k = k - 1;
+    for (int rank = 0; rank < last_k; rank++) {
+      Entry<T> const &max_element = max_heap.root();
+      // top_k_values[rank] = max_element.value;
+      int shard_index = max_element.index;
+      top_k_indices[rank] = entries[shard_index].index;
+      int next_shard_index = shard_index + num_shards;
+      // For rank < k-1, each top k heap still contains at least 1 element,
+      // so we can draw a replacement.
+      max_heap.replace_root({next_shard_index, entries[next_shard_index].value},
+                            heap_size);
+    }
+
+    // rank == last_k.
+    Entry<T> const &max_element = max_heap.root();
+    // top_k_values[last_k] = max_element.value;
+    int shard_index = max_element.index;
+    top_k_indices[last_k] = entries[shard_index].index;
+  }
+}
+
+template <typename T>
+__global__ void arg_topk_forward_kernel(T const *__restrict__ input,
+                                        size_t shared_memory_size,
+                                        int length,
+                                        int k,
+                                        bool sorted,
+                                        // T *__restrict__ output,
+                                        int *__restrict__ indices) {
+  __shared__ char shared_memory[48 << 10];
+  int const batch_index = blockIdx.x;
+  T const *batch_input = input + batch_index * length;
+  int const thread_index = threadIdx.x;
+  int const thread_count = blockDim.x;
+  Entry<T> *shared_entries = (Entry<T> *)shared_memory;
+  heapArgTopK<T, StridedData>(
+      batch_input, length, k, shared_entries, true, thread_index, thread_count);
+  __syncthreads();
+  if (thread_index == 0) {
+    int const offset = batch_index * k;
+    // auto batch_output = output + offset;
+    auto batch_indices = indices + offset;
+    Entry<T> *top_k_heap = shared_entries + thread_count * k;
+    mergeShards(thread_count,
+                k,
+                shared_entries,
+                top_k_heap,
+                // batch_output,
+                batch_indices);
+  }
+}
+
+/*static*/
+template <typename DT>
+void ArgTopK::forward_kernel(ArgTopKMeta const *m,
+                             DT const *input_ptr,
+                             // float *output_ptr,
+                             int *indices_ptr,
+                             size_t batch_size,
+                             int length,
+                             int k,
+                             bool sorted,
+                             hipStream_t stream) {
+  // Adopted from TensorFlow's ArgTopK implementation
+  // https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/topk_op_gpu.h
+  int num_shards = 0;
+  {
+    constexpr auto shared_memory_size = 48 << 10;
+    auto const heap_size = k * sizeof(Entry<DT>);
+    // shared_memory_size = (num_shards + 1) * heap_size <=>
+    num_shards = shared_memory_size / heap_size - 1;
+    assert(num_shards > 0);
+    if (num_shards > CUDA_NUM_THREADS) {
+      num_shards = CUDA_NUM_THREADS;
+    }
+  }
+  // We are limited by the amount of shared memory we have per block.
+  size_t shared_memory_size = (num_shards + 1) * k * sizeof(Entry<DT>);
+  // size_t num_blocks = (batch_size + num_shards - 1) / num_shards;
+  size_t num_blocks = batch_size;
+  assert(num_shards >= (size_t)k);
+  num_shards = k;
+  hipLaunchKernelGGL(arg_topk_forward_kernel,
+                     num_blocks,
+                     num_shards,
+                     0,
+                     stream,
+                     input_ptr,
+                     shared_memory_size,
+                     length,
+                     k,
+                     sorted,
+                     // output_ptr,
+                     indices_ptr);
+}
+
+/*static*/
+void ArgTopK::forward_kernel_wrapper(ArgTopKMeta const *m,
+                                     GenericTensorAccessorR const &input,
+                                     // float *output_ptr,
+                                     GenericTensorAccessorW const &indices,
+                                     int batch_size) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  // Domain in1_domain = runtime->get_index_space_domain(
+  //     ctx, task->regions[0].region.get_index_space());
+  //   Domain out1_domain = runtime->get_index_space_domain(
+  //       ctx, task->regions[1].region.get_index_space());
+  // Domain out2_domain = runtime->get_index_space_domain(
+  //     ctx, task->regions[1].region.get_index_space());
+  int numdims = input.domain.get_dim();
+  assert(indices.domain.get_dim() == numdims);
+
+  int in_cols = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+  // int out1_cols = out1_domain.hi()[0] - out1_domain.lo()[0] + 1;
+  int out2_cols = indices.domain.hi()[0] - indices.domain.lo()[0] + 1;
+
+  // assert(out1_domain == out2_domain);
+  for (int i = 1; i < input.domain.get_dim(); i++) {
+    assert(input.domain.lo()[i] == indices.domain.lo()[i]);
+    assert(input.domain.hi()[i] == indices.domain.hi()[i]);
+  }
+  // float const *in_ptr = helperGetTensorPointerRO<float>(
+  //     regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  //   float *value_ptr = helperGetTensorPointerWO<float>(
+  //       regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  // int *index_ptr = helperGetTensorPointerWO<int>(
+  //    regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  int length = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+  int k = indices.domain.hi()[0] - indices.domain.lo()[0] +
+          1; /*TODO: This prints to 5*/
+  // size_t batch_size = input.domain.get_volume() / length;
+  // assert(indices.domain.get_volume() / k == batch_size);
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  if (input.data_type == DT_HALF) {
+    ArgTopK::forward_kernel(m,
+                            input.get_half_ptr(),
+                            // output_ptr,
+                            indices.get_int32_ptr(),
+                            batch_size,
+                            length,
+                            k,
+                            m->sorted,
+                            stream);
+  } else if (input.data_type == DT_FLOAT) {
+    ArgTopK::forward_kernel(m,
+                            input.get_float_ptr(),
+                            // output_ptr,
+                            indices.get_int32_ptr(),
+                            batch_size,
+                            length,
+                            k,
+                            m->sorted,
+                            stream);
+  } else {
+    assert(false && "Unsupported data type");
+  }
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+  }
+}
+
+ArgTopKMeta::ArgTopKMeta(FFHandler handler, Op const *op)
+    : OpMeta(handler, op) {}
+
+}; // namespace FlexFlow
diff --git a/src/ops/arg_topk.cu b/src/ops/arg_topk.cu
new file mode 100644
index 0000000000..575e0183b4
--- /dev/null
+++ b/src/ops/arg_topk.cu
@@ -0,0 +1,489 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/arg_topk.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::coord_t;
+
+enum class HeapType { kMinHeap, kMaxHeap };
+enum class PreferIndices { kLower, kHigher };
+
+template <typename T>
+struct Entry {
+  int index;
+  T value;
+};
+
+template <typename T>
+struct LinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return data[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+};
+
+template <typename T>
+struct IndirectLinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return backing_data[data[i].index].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+  Entry *const backing_data;
+};
+
+template <typename T>
+struct StridedData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index * blockDim.x + threadIdx.x];
+  }
+
+  __device__ int get_index(int i) const {
+    return (*this)[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return (*this)[i].value;
+  }
+
+  Entry *const data;
+};
+
+// A heap of Entry<T> that can either work as a min-heap or as a max-heap.
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+struct IndexedHeap {
+  typedef typename Data<T>::Entry Entry;
+  Data<T> const data;
+  __device__ IndexedHeap(Data<T> const &d) : data(d) {}
+
+  __device__ bool is_above(int left, int right) {
+    T left_value = data.get_value(left);
+    T right_value = data.get_value(right);
+    if (left_value == right_value) {
+      if (preferIndices == PreferIndices::kLower) {
+        return data.get_index(left) < data.get_index(right);
+      } else {
+        return data.get_index(left) > data.get_index(right);
+      }
+    }
+    if (heapType == HeapType::kMinHeap) {
+      return left_value < right_value;
+    } else {
+      return left_value > right_value;
+    }
+  }
+
+  __device__ void assign(int i, Entry const &entry) {
+    data[i] = entry;
+  }
+
+  __device__ void push_up(int i) {
+    int child = i;
+    int parent;
+    for (; child > 0; child = parent) {
+      parent = (child - 1) / 2;
+      if (!is_above(child, parent)) {
+        // Heap property satisfied.
+        break;
+      }
+      swap(child, parent);
+    }
+  }
+
+  __device__ void swap(int a, int b) {
+    auto tmp = data[b];
+    data[b] = data[a];
+    data[a] = tmp;
+  }
+
+  __device__ void push_root_down(int k) {
+    push_down(0, k);
+  }
+
+  // MAX-HEAPIFY in Cormen
+  __device__ void push_down(int node, int k) {
+    while (true) {
+      int const left = 2 * node + 1;
+      int const right = left + 1;
+      int smallest = node;
+      if (left < k && is_above(left, smallest)) {
+        smallest = left;
+      }
+      if (right < k && is_above(right, smallest)) {
+        smallest = right;
+      }
+      if (smallest == node) {
+        break;
+      }
+      swap(smallest, node);
+      node = smallest;
+    }
+  }
+
+  // BUILD-MAX-HEAPIFY in Cormen
+  __device__ void build(int k) {
+    for (int node = (k - 1) / 2; node >= 0; node--) {
+      push_down(node, k);
+    }
+  }
+
+  // HEAP-EXTRACT-MAX in Cormen
+  __device__ void remove_root(int k) {
+    data[0] = data[k - 1];
+    push_root_down(k - 1);
+  }
+
+  // in-place HEAPSORT in Cormen
+  // This method destroys the heap property.
+  __device__ void sort(int k) {
+    for (int slot = k - 1; slot > 0; slot--) {
+      // This is like remove_root but we insert the element at the end.
+      swap(slot, 0);
+      // Heap is now an element smaller.
+      push_root_down(/*k=*/slot);
+    }
+  }
+
+  __device__ void replace_root(Entry const &entry, int k) {
+    data[0] = entry;
+    push_root_down(k);
+  }
+
+  __device__ Entry const &root() {
+    return data[0];
+  }
+};
+
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+__device__ IndexedHeap<heapType, preferIndices, Data, T>
+    make_indexed_heap(typename Data<T>::Entry *data) {
+  return IndexedHeap<heapType, preferIndices, Data, T>{Data<T>{data}};
+}
+
+// heapArgTopK walks over [input, input+length) with `step_size` stride starting
+// at `start_index`. It builds a top-`k` heap that is stored in `heap_entries`
+// using `Accessor` to access elements in `heap_entries`. If sorted=true, the
+// elements will be sorted at the end.
+template <typename T, template <typename> class Data = LinearData>
+__device__ void heapArgTopK(T const *__restrict__ input,
+                            int length,
+                            int k,
+                            Entry<T> *__restrict__ heap_entries,
+                            bool sorted = false,
+                            int start_index = 0,
+                            int step_size = 1) {
+  assert(k <= length);
+
+  auto heap =
+      make_indexed_heap<HeapType::kMinHeap, PreferIndices::kHigher, Data, T>(
+          heap_entries);
+
+  int heap_end_index = start_index + k * step_size;
+  if (heap_end_index > length) {
+    heap_end_index = length;
+  }
+  // Initialize the min-heap.
+  for (int index = start_index, slot = 0; index < heap_end_index;
+       index += step_size, slot++) {
+    heap.assign(slot, {index, input[index]});
+  }
+
+  heap.build(k);
+
+  // Now iterate over the remaining items.
+  // If an item is smaller than the min element, it is not amongst the top k.
+  // Otherwise, replace the min element with it and push upwards.
+  for (int index = heap_end_index; index < length; index += step_size) {
+    // We prefer elements with lower indices. This is given here.
+    // Later elements automatically have higher indices, so can be discarded.
+    if (input[index] > heap.root().value) {
+      // This element should replace the min.
+      heap.replace_root({index, input[index]}, k);
+    }
+  }
+
+  // Sort if wanted.
+  if (sorted) {
+    heap.sort(k);
+  }
+}
+
+// mergeShards performs a top-k merge on `num_shards` many sorted streams that
+// are sorted and stored in `entries` in a strided way:
+// |s_1 1st|s_2 1st|...s_{num_shards} 1st|s_1 2nd|s_2 2nd|...
+// The overall top k elements are written to `top_k_values` and their indices
+// to top_k_indices.
+// `top_k_heap` is used as temporary storage for the merge heap.
+template <typename T>
+__device__ void mergeShards(int num_shards,
+                            int k,
+                            Entry<T> *__restrict__ entries,
+                            Entry<T> *__restrict__ top_k_heap,
+                            // T *top_k_values,
+                            int *top_k_indices) {
+  // If k < num_shards, we can use a min-heap with k elements to get the top k
+  // of the sorted blocks.
+  // If k > num_shards, we can initialize a min-heap with the top element from
+  // each sorted block.
+  int const heap_size = k < num_shards ? k : num_shards;
+
+  // Min-heap part.
+  {
+    auto min_heap = IndexedHeap<HeapType::kMinHeap,
+                                PreferIndices::kHigher,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Initialize the heap as a min-heap.
+    for (int slot = 0; slot < heap_size; slot++) {
+      min_heap.assign(slot, {slot, entries[slot].value});
+    }
+    min_heap.build(heap_size);
+
+    // Now perform top k with the remaining shards (if num_shards > heap_size).
+    for (int shard = heap_size; shard < num_shards; shard++) {
+      auto const entry = entries[shard];
+      auto const root = min_heap.root();
+      if (entry.value < root.value) {
+        continue;
+      }
+      if (entry.value == root.value &&
+          entry.index > entries[root.index].index) {
+        continue;
+      }
+      // This element should replace the min.
+      min_heap.replace_root({shard, entry.value}, heap_size);
+    }
+  }
+
+  // Max-part.
+  {
+    // Turn the min-heap into a max-heap in-place.
+    auto max_heap = IndexedHeap<HeapType::kMaxHeap,
+                                PreferIndices::kLower,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Heapify into a max heap.
+    max_heap.build(heap_size);
+
+    // Now extract the minimum k-1 times.
+    // k is treated specially.
+    int const last_k = k - 1;
+    for (int rank = 0; rank < last_k; rank++) {
+      Entry<T> const &max_element = max_heap.root();
+      // top_k_values[rank] = max_element.value;
+      int shard_index = max_element.index;
+      top_k_indices[rank] = entries[shard_index].index;
+      int next_shard_index = shard_index + num_shards;
+      // For rank < k-1, each top k heap still contains at least 1 element,
+      // so we can draw a replacement.
+      max_heap.replace_root({next_shard_index, entries[next_shard_index].value},
+                            heap_size);
+    }
+
+    // rank == last_k.
+    Entry<T> const &max_element = max_heap.root();
+    // top_k_values[last_k] = max_element.value;
+    int shard_index = max_element.index;
+    top_k_indices[last_k] = entries[shard_index].index;
+  }
+}
+
+template <typename T>
+__global__ void arg_topk_forward_kernel(T const *__restrict__ input,
+                                        size_t shared_memory_size,
+                                        int length,
+                                        int k,
+                                        bool sorted,
+                                        // T *__restrict__ output,
+                                        int *__restrict__ indices) {
+  __shared__ char shared_memory[48 << 10];
+  int const batch_index = blockIdx.x;
+  T const *batch_input = input + batch_index * length;
+  int const thread_index = threadIdx.x;
+  int const thread_count = blockDim.x;
+  Entry<T> *shared_entries = (Entry<T> *)shared_memory;
+  heapArgTopK<T, StridedData>(
+      batch_input, length, k, shared_entries, true, thread_index, thread_count);
+  __syncthreads();
+  if (thread_index == 0) {
+    int const offset = batch_index * k;
+    // auto batch_output = output + offset;
+    auto batch_indices = indices + offset;
+    Entry<T> *top_k_heap = shared_entries + thread_count * k;
+    mergeShards(thread_count,
+                k,
+                shared_entries,
+                top_k_heap,
+                // batch_output,
+                batch_indices);
+  }
+}
+
+/*static*/
+template <typename DT>
+void ArgTopK::forward_kernel(ArgTopKMeta const *m,
+                             DT const *input_ptr,
+                             // float *output_ptr,
+                             int *indices_ptr,
+                             size_t batch_size,
+                             int length,
+                             int k,
+                             bool sorted,
+                             cudaStream_t stream) {
+  // Adopted from TensorFlow's ArgTopK implementation
+  // https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/topk_op_gpu.h
+  int num_shards = 0;
+  {
+    constexpr auto shared_memory_size = 48 << 10;
+    auto const heap_size = k * sizeof(Entry<DT>);
+    // shared_memory_size = (num_shards + 1) * heap_size <=>
+    num_shards = shared_memory_size / heap_size - 1;
+    assert(num_shards > 0);
+    if (num_shards > CUDA_NUM_THREADS) {
+      num_shards = CUDA_NUM_THREADS;
+    }
+  }
+  // We are limited by the amount of shared memory we have per block.
+  size_t shared_memory_size = (num_shards + 1) * k * sizeof(Entry<DT>);
+  // size_t num_blocks = (batch_size + num_shards - 1) / num_shards;
+  size_t num_blocks = batch_size;
+  assert(num_shards >= (size_t)k);
+  num_shards = k;
+  arg_topk_forward_kernel<<<num_blocks, num_shards, 0, stream>>>(
+      input_ptr,
+      shared_memory_size,
+      length,
+      k,
+      sorted,
+      // output_ptr,
+      indices_ptr);
+}
+
+/*static*/
+void ArgTopK::forward_kernel_wrapper(ArgTopKMeta const *m,
+                                     GenericTensorAccessorR const &input,
+                                     // float *output_ptr,
+                                     GenericTensorAccessorW const &indices,
+                                     int batch_size) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  // Domain in1_domain = runtime->get_index_space_domain(
+  //     ctx, task->regions[0].region.get_index_space());
+  //   Domain out1_domain = runtime->get_index_space_domain(
+  //       ctx, task->regions[1].region.get_index_space());
+  // Domain out2_domain = runtime->get_index_space_domain(
+  //     ctx, task->regions[1].region.get_index_space());
+  int numdims = input.domain.get_dim();
+  assert(indices.domain.get_dim() == numdims);
+
+  int in_cols = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+  // int out1_cols = out1_domain.hi()[0] - out1_domain.lo()[0] + 1;
+  int out2_cols = indices.domain.hi()[0] - indices.domain.lo()[0] + 1;
+
+  // assert(out1_domain == out2_domain);
+  for (int i = 1; i < input.domain.get_dim(); i++) {
+    assert(input.domain.lo()[i] == indices.domain.lo()[i]);
+    assert(input.domain.hi()[i] == indices.domain.hi()[i]);
+  }
+  // float const *in_ptr = helperGetTensorPointerRO<float>(
+  //     regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  //   float *value_ptr = helperGetTensorPointerWO<float>(
+  //       regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  // int *index_ptr = helperGetTensorPointerWO<int>(
+  //    regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  int length = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+  int k = indices.domain.hi()[0] - indices.domain.lo()[0] +
+          1; /*TODO: This prints to 5*/
+  // batch_size = input.domain.get_volume() / length;
+  // assert(indices.domain.get_volume() / k == batch_size);
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  if (input.data_type == DT_HALF) {
+    ArgTopK::forward_kernel(m,
+                            input.get_half_ptr(),
+                            // output_ptr,
+                            indices.get_int32_ptr(),
+                            batch_size,
+                            length,
+                            k,
+                            m->sorted,
+                            stream);
+  } else if (input.data_type == DT_FLOAT) {
+    ArgTopK::forward_kernel(m,
+                            input.get_float_ptr(),
+                            // output_ptr,
+                            indices.get_int32_ptr(),
+                            batch_size,
+                            length,
+                            k,
+                            m->sorted,
+                            stream);
+  } else {
+    assert(false && "Unsupported data type");
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("[ArgTopK] forward time = %.2lfms\n", elapsed);
+  }
+}
+
+ArgTopKMeta::ArgTopKMeta(FFHandler handler, Op const *op)
+    : OpMeta(handler, op) {}
+
+}; // namespace FlexFlow
diff --git a/src/ops/argmax.cc b/src/ops/argmax.cc
new file mode 100644
index 0000000000..7863931c82
--- /dev/null
+++ b/src/ops/argmax.cc
@@ -0,0 +1,432 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/argmax.h"
+#include "flexflow/model.h"
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::InlineLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+Tensor FFModel::argmax(const Tensor input, bool beam_search, char const *name) {
+  Layer *li = new Layer(this,
+                        OP_ARGMAX,
+                        input->data_type,
+                        name,
+                        1 /*inputs*/,
+                        0 /*weights*/,
+                        beam_search ? 2 : 1 /*outputs*/,
+                        input);
+  {
+    int numdims = input->num_dims;
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    // now just support 1 output
+    dims[0] = 1;
+    // li->outputs[0] = create_tensor_legion_ordering(
+    //     numdims, dims, input->data_type, li, 0, true /*create_grad*/);
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, DT_INT32, li, 0, false /*create_grad*/);
+    if (beam_search) {
+      // parent id
+      li->outputs[1] = create_tensor_legion_ordering(
+          numdims, dims, DT_INT32, li, 1, false /*create_grad*/);
+    }
+  }
+  li->add_int_property("beam_search", beam_search);
+  layers.push_back(li);
+  // outputs[0] = li->outputs[0];
+  // outputs[1] = li->outputs[1];
+  return li->outputs[0];
+}
+
+Op *ArgMax::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  long long value;
+  layer->get_int_property("beam_search", value);
+  bool beam_search = (bool)value;
+  return new ArgMax(model, inputs[0], beam_search, layer->name);
+}
+
+ArgMaxParams ArgMax::get_params() const {
+  ArgMaxParams params;
+  params.beam_search = this->beam_search;
+  return params;
+}
+
+bool ArgMaxParams::is_valid(ParallelTensorShape const &) const {
+  return true;
+}
+
+bool operator==(ArgMaxParams const &lhs, ArgMaxParams const &rhs) {
+  return lhs.beam_search == rhs.beam_search;
+}
+
+ArgMax::ArgMax(FFModel &model,
+               const ParallelTensor _input,
+               bool _beam_search,
+               char const *name)
+    : Op(model,
+         OP_ARGMAX,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         0 /*weights*/,
+         _beam_search ? 2 : 1 /*outputs*/,
+         _input),
+      beam_search(_beam_search) {
+  int numdim = inputs[0]->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = inputs[0]->dims[i];
+  }
+  dims[0].size = 1;
+  assert(inputs[0]->dims[0].degree == 1);
+  assert(inputs[0]->dims[0].parallel_idx == -1);
+  //   outputs[0] = model.create_parallel_tensor_legion_ordering(
+  //       numdim, dims, _input->data_type, this, 0 /*owner_idx*/);
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      numdim, dims, DT_INT32, this, 0 /*owner_idx*/);
+  if (_beam_search) {
+    outputs[1] = model.create_parallel_tensor_legion_ordering(
+        numdim, dims, DT_INT32, this, 1 /*owner_idx*/);
+  }
+}
+
+ArgMax::ArgMax(FFModel &model, ArgMax const &other, const ParallelTensor input)
+    : ArgMax(model, input, other.beam_search, other.name) {}
+
+ArgMax::ArgMax(FFModel &model,
+               ArgMaxParams const &params,
+               const ParallelTensor input,
+               char const *name)
+    : ArgMax(model, input, params.beam_search, name) {}
+
+void ArgMax::init_inference(FFModel const &ff,
+                            std::vector<ParallelTensor> const &batch_inputs,
+                            std::vector<ParallelTensor> const &batch_outputs,
+                            MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(ARGMAX_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(ArgMax)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void ArgMax::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(ARGMAX_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(ArgMax)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+OpMeta *ArgMax::init_task(Task const *task,
+                          std::vector<PhysicalRegion> const &regions,
+                          Context ctx,
+                          Runtime *runtime) {
+  ArgMax *s = (ArgMax *)task->args;
+  FFHandler handle = *((FFHandler *)task->local_args);
+  GenericTensorAccessorW acc_input =
+      helperGetGenericTensorAccessorRW(s->inputs[0]->data_type,
+                                       regions[0],
+                                       task->regions[0],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  Domain input_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  Domain output_domain = runtime->get_index_space_domain(
+      ctx, task->regions[1].region.get_index_space());
+  int length = acc_input.domain.hi()[0] - acc_input.domain.lo()[0] + 1;
+  int batch_size = acc_input.domain.get_volume() / length;
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+
+  ArgMaxMeta *m = new ArgMaxMeta(handle,
+                                 s,
+                                 input_domain,
+                                 output_domain,
+                                 acc_input,
+                                 batch_size,
+                                 length * batch_size,
+                                 gpu_mem_allocator);
+  m->profiling = s->profiling;
+  m->beam_search = s->beam_search;
+  return m;
+}
+
+void ArgMax::forward(FFModel const &ff) {
+  // ArgMax does not support forward
+  assert(false);
+}
+
+FutureMap ArgMax::inference(FFModel const &ff,
+                            BatchConfigFuture const &bc,
+                            std::vector<ParallelTensor> const &batch_inputs,
+                            std::vector<ParallelTensor> const &batch_outputs,
+                            MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "ArgMax op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  if (beam_search) {
+    IndexLauncher launcher(ARGMAX_BEAM_INF_TASK_ID,
+                           parallel_is,
+                           TaskArgument(nullptr, 0),
+                           argmap,
+                           Predicate::TRUE_PRED,
+                           false /*must*/,
+                           0 /*mapper_id*/,
+                           machine_view_hash);
+    launcher.add_future(bc);
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_WRITE,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(1, FID_DATA);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[1]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[1]->region));
+    launcher.add_field(2, FID_DATA);
+    return runtime->execute_index_space(ctx, launcher);
+  } else {
+    IndexLauncher launcher(ARGMAX_NORM_INF_TASK_ID,
+                           parallel_is,
+                           TaskArgument(nullptr, 0),
+                           argmap,
+                           Predicate::TRUE_PRED,
+                           false /*must*/,
+                           0 /*mapper_id*/,
+                           machine_view_hash);
+    launcher.add_future(bc);
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_WRITE,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(1, FID_DATA);
+    return runtime->execute_index_space(ctx, launcher);
+  }
+}
+
+BeamInferenceResult
+    ArgMax::inference_task_beam(Task const *task,
+                                std::vector<PhysicalRegion> const &regions,
+                                Context ctx,
+                                Runtime *runtime) {
+  assert(regions.size() == 3);
+  assert(task->regions.size() == 3);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  if (bc->num_tokens == 0) {
+    // Directly return for empty batch config
+    BeamInferenceResult ir;
+    return ir;
+  }
+  ArgMaxMeta const *m = *((ArgMaxMeta **)task->local_args);
+
+  GenericTensorAccessorW input = helperGetGenericTensorAccessorRW(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW indices = helperGetGenericTensorAccessorWO(
+      DT_INT32, regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  int batch_size = bc->num_active_tokens();
+  GenericTensorAccessorW parent = helperGetGenericTensorAccessorWO(
+      DT_INT32, regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  ArgMax::forward_kernel_wrapper(m, input, indices, parent, batch_size);
+
+  BeamInferenceResult ir;
+  download_tensor<BatchConfig::TokenId>(
+      indices.get_int32_ptr(), ir.token_ids, batch_size);
+  download_tensor(m->probs, ir.probs, batch_size);
+  download_tensor<int>(parent.get_int32_ptr(), ir.parent_id, batch_size);
+  return ir;
+}
+
+InferenceResult
+    ArgMax::inference_task_norm(Task const *task,
+                                std::vector<PhysicalRegion> const &regions,
+                                Context ctx,
+                                Runtime *runtime) {
+  assert(regions.size() == 2);
+  assert(task->regions.size() == 2);
+  ArgMaxMeta const *m = *((ArgMaxMeta **)task->local_args);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  if (bc->num_tokens == 0) {
+    // Directly return for empty batch config
+    InferenceResult ir;
+    return ir;
+  }
+
+  GenericTensorAccessorW input = helperGetGenericTensorAccessorRW(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW indices = helperGetGenericTensorAccessorWO(
+      DT_INT32, regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW parent;
+  int batch_size = bc->num_active_tokens();
+  ArgMax::forward_kernel_wrapper(m, input, indices, parent, batch_size);
+  InferenceResult ir;
+  download_tensor<BatchConfig::TokenId>(
+      indices.get_int32_ptr(), ir.token_ids, batch_size);
+  return ir;
+}
+
+void ArgMax::backward(FFModel const &ff) {
+  // ArgMax does not support backward
+  assert(false);
+}
+
+void ArgMax::serialize(Legion::Serializer &sez) const {
+  sez.serialize(this->beam_search);
+}
+
+Node ArgMax::deserialize(FFModel &ff,
+                         Legion::Deserializer &dez,
+                         ParallelTensor inputs[],
+                         int num_inputs) {
+  assert(num_inputs == 1);
+  bool beam_search;
+  dez.deserialize(beam_search);
+  ArgMaxParams params;
+  params.beam_search = beam_search;
+  return ff.get_or_create_node<ArgMax>(inputs[0], params);
+}
+
+Op *ArgMax::materialize(FFModel &ff,
+                        ParallelTensor inputs[],
+                        int num_inputs) const {
+  ArgMaxParams params = get_params();
+  return new ArgMax(ff, params, inputs[0], this->name);
+}
+
+bool ArgMax::measure_operator_cost(Simulator *sim,
+                                   MachineView const &mv,
+                                   CostMetrics &cost_metrics) const {
+  return false;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::ArgMaxParams>::operator()(
+    FlexFlow::ArgMaxParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.beam_search);
+  return key;
+}
+}; // namespace std
\ No newline at end of file
diff --git a/src/ops/argmax.cpp b/src/ops/argmax.cpp
new file mode 100644
index 0000000000..778ddf3c9d
--- /dev/null
+++ b/src/ops/argmax.cpp
@@ -0,0 +1,74 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/argmax.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+/*static*/
+template <typename DT>
+void ArgMax::forward_kernel(ArgMaxMeta const *m,
+                            DT *input_ptr,
+                            int *indices_ptr,
+                            float *prob_ptr,
+                            int *parent_ptr,
+                            int length,
+                            int batch_size,
+                            ffStream_t stream) {}
+
+/*static*/
+void ArgMax::forward_kernel_wrapper(ArgMaxMeta const *m,
+                                    GenericTensorAccessorW const &input,
+                                    GenericTensorAccessorW const &indices,
+                                    GenericTensorAccessorW const &parent,
+                                    int batch_size) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  handle_unimplemented_hip_kernel(OP_RMS_NORM);
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+  }
+}
+
+ArgMaxMeta::ArgMaxMeta(FFHandler handler,
+                       Op const *op,
+                       Legion::Domain const &input_domain,
+                       Legion::Domain const &output_domain,
+                       GenericTensorAccessorW input,
+                       int batch_size,
+                       int total_ele,
+                       MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler, op) {}
+
+ArgMaxMeta::~ArgMaxMeta(void) {}
+
+}; // namespace FlexFlow
\ No newline at end of file
diff --git a/src/ops/argmax.cu b/src/ops/argmax.cu
new file mode 100644
index 0000000000..37e067006c
--- /dev/null
+++ b/src/ops/argmax.cu
@@ -0,0 +1,212 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/argmax.h"
+#include "flexflow/utils/cuda_helper.h"
+#include <cub/cub.cuh>
+
+namespace FlexFlow {
+
+__global__ void init_offset(int batch_size,
+                            int vocab_size,
+                            int total_eles,
+                            int *d_offsets) {
+  CUDA_KERNEL_LOOP(i, total_eles) {
+    if (i % vocab_size == 0) {
+      d_offsets[i / vocab_size] = i;
+    }
+  }
+}
+
+template <typename DT>
+__global__ void copy_result(cub::KeyValuePair<int, DT> *d_out,
+                            int *indices,
+                            float *prob_ptr,
+                            int batch_size,
+                            bool beam_search) {
+  CUDA_KERNEL_LOOP(i, batch_size) {
+    indices[i] = d_out[i].key;
+    if (beam_search) {
+      prob_ptr[i] = static_cast<float>(d_out[i].value);
+    }
+  }
+}
+
+/*static*/
+template <typename DT>
+void ArgMax::forward_kernel(ArgMaxMeta const *m,
+                            DT *input_ptr,
+                            int *indices_ptr,
+                            float *prob_ptr,
+                            int *parent,
+                            int const length,
+                            int const batch_size,
+                            cudaStream_t stream) {
+
+  checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
+  DT alpha = 1.0f, beta = 0.0f;
+  if (m->beam_search) {
+    // set all parents id zero in arg top1 case.
+    checkCUDA(cudaMemset(parent, 0, batch_size * sizeof(int)));
+  }
+  size_t temp_storage_bytes = m->temp_storage_bytes;
+  // use cub
+  checkCUDA(cub::DeviceSegmentedReduce::ArgMax(
+      m->d_temp_storage,
+      temp_storage_bytes,
+      input_ptr,
+      static_cast<cub::KeyValuePair<int, DT> *>(m->d_out),
+      batch_size,
+      m->d_offsets,
+      m->d_offsets + 1,
+      stream));
+
+  // copy dout to incides
+  int parallelism = batch_size;
+  copy_result<<<GET_BLOCKS(parallelism),
+                min(CUDA_NUM_THREADS, parallelism),
+                0,
+                stream>>>(static_cast<cub::KeyValuePair<int, DT> *>(m->d_out),
+                          indices_ptr,
+                          prob_ptr,
+                          batch_size,
+                          m->beam_search);
+}
+
+/*static*/
+void ArgMax::forward_kernel_wrapper(ArgMaxMeta const *m,
+                                    GenericTensorAccessorW const &input,
+                                    GenericTensorAccessorW const &indices,
+                                    GenericTensorAccessorW const &parent,
+                                    int batch_size) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+  int length = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+
+  if (input.data_type == DT_HALF) {
+    ArgMax::forward_kernel<half>(m,
+                                 input.get_half_ptr(),
+                                 indices.get_int32_ptr(),
+                                 m->probs,
+                                 m->beam_search ? parent.get_int32_ptr()
+                                                : nullptr,
+                                 length,
+                                 batch_size,
+                                 stream);
+
+  } else if (input.data_type == DT_FLOAT) {
+    ArgMax::forward_kernel<float>(m,
+                                  input.get_float_ptr(),
+                                  indices.get_int32_ptr(),
+                                  m->probs,
+                                  m->beam_search ? parent.get_int32_ptr()
+                                                 : nullptr,
+                                  length,
+                                  batch_size,
+                                  stream);
+  } else {
+    assert(false && "Unsupported data type");
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("[ArgMax] forward time = %.2lfms\n", elapsed);
+  }
+}
+
+ArgMaxMeta::ArgMaxMeta(FFHandler handler,
+                       Op const *op,
+                       Legion::Domain const &input_domain,
+                       Legion::Domain const &output_domain,
+                       GenericTensorAccessorW input,
+                       int batch_size,
+                       int total_ele,
+                       MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler, op) {
+  DataType data_type = op->data_type;
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  size_t d_offsets_size = batch_size;
+  size_t prob_size = batch_size;
+  assert(data_type == DT_FLOAT || data_type == DT_HALF);
+  size_t total_size =
+      d_offsets_size * sizeof(int) +
+      (data_type == DT_FLOAT
+           ? sizeof(cub::KeyValuePair<int, float>) * batch_size
+           : sizeof(cub::KeyValuePair<int, half>) * batch_size) +
+      prob_size * sizeof(float);
+  gpu_mem_allocator.create_legion_instance(reserveInst, total_size);
+  d_offsets = gpu_mem_allocator.allocate_instance<int>(d_offsets_size);
+  d_out = data_type == DT_FLOAT
+              ? gpu_mem_allocator.allocate_instance_untyped(
+                    batch_size * sizeof(cub::KeyValuePair<int, float>))
+              : gpu_mem_allocator.allocate_instance_untyped(
+                    batch_size * sizeof(cub::KeyValuePair<int, half>));
+  probs = gpu_mem_allocator.allocate_instance<float>(prob_size);
+  // init offset
+  int parallelism = total_ele;
+  init_offset<<<GET_BLOCKS(parallelism),
+                min(CUDA_NUM_THREADS, parallelism),
+                0,
+                stream>>>(
+      batch_size, total_ele / batch_size, total_ele, d_offsets);
+
+  if (data_type == DT_FLOAT) {
+    checkCUDA(cub::DeviceSegmentedReduce::ArgMax(
+        d_temp_storage,
+        temp_storage_bytes,
+        input.get_float_ptr(),
+        static_cast<cub::KeyValuePair<int, float> *>(d_out),
+        batch_size,
+        d_offsets,
+        d_offsets + 1,
+        stream));
+
+  } else if (data_type == DT_HALF) {
+    checkCUDA(cub::DeviceSegmentedReduce::ArgMax(
+        d_temp_storage,
+        temp_storage_bytes,
+        input.get_half_ptr(),
+        static_cast<cub::KeyValuePair<int, half> *>(d_out),
+        batch_size,
+        d_offsets,
+        d_offsets + 1,
+        stream));
+  }
+
+  gpu_mem_allocator.create_legion_instance(reserveInst, temp_storage_bytes);
+  d_temp_storage =
+      gpu_mem_allocator.allocate_instance_untyped(temp_storage_bytes);
+}
+
+ArgMaxMeta::~ArgMaxMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
+}
+}; // namespace FlexFlow
\ No newline at end of file
diff --git a/src/ops/attention.cc b/src/ops/attention.cc
index 584a84f503..027ea18634 100644
--- a/src/ops/attention.cc
+++ b/src/ops/attention.cc
@@ -59,8 +59,11 @@ Tensor FFModel::multihead_attention(const Tensor query,
                                     bool bias,
                                     bool add_bias_kv,
                                     bool add_zero_attn,
+                                    DataType data_type,
                                     Initializer *kernel_initializer,
                                     char const *name) {
+  // Currently only support float for the original attention operator
+  assert(data_type == DT_NONE || data_type == DT_FLOAT);
   Layer *li = new Layer(this,
                         OP_MULTIHEAD_ATTENTION,
                         DT_FLOAT,
@@ -217,17 +220,12 @@ MultiHeadAttention::MultiHeadAttention(FFModel &model,
     dims[2].parallel_idx = -1;
     int seed = std::rand();
     Initializer *initializer = new GlorotUniform(seed);
-#ifdef USE_NCCL
-    ParameterSyncType comm_type = ParameterSyncType::NCCL;
-#else
-    ParameterSyncType comm_type = ParameterSyncType::PS;
-#endif
     weights[0] = model.create_parallel_weight<3>(dims,
                                                  DT_FLOAT,
                                                  NULL /*owner_op*/,
                                                  true /*create_grad*/,
                                                  initializer,
-                                                 comm_type);
+                                                 CHOSEN_SYNC_TYPE);
   }
 
   outputs[0] = model.create_parallel_tensor_legion_ordering(
@@ -304,17 +302,12 @@ MultiHeadAttention::MultiHeadAttention(FFModel &model,
     dims[2].size = qParas + kParas + vParas + oParas;
     int seed = std::rand();
     Initializer *initializer = new GlorotUniform(seed);
-#ifdef USE_NCCL
-    ParameterSyncType comm_type = ParameterSyncType::NCCL;
-#else
-    ParameterSyncType comm_type = ParameterSyncType::PS;
-#endif
     weights[0] = model.create_parallel_weight<3>(dims,
                                                  DT_FLOAT,
                                                  NULL /*owner_op*/,
                                                  true /*create_grad*/,
                                                  initializer,
-                                                 comm_type);
+                                                 CHOSEN_SYNC_TYPE);
   }
   outputs[0] = model.create_parallel_tensor_legion_ordering(
       _query->num_dims, dims, DT_FLOAT, this);
@@ -372,6 +365,62 @@ MultiHeadAttention::MultiHeadAttention(
                          allocate_weights,
                          name) {}
 
+void MultiHeadAttention::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(ATTENTION_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(MultiHeadAttention)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[2]->region));
+  launcher.add_field(2, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(3, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(4, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void MultiHeadAttention::init(FFModel const &ff) {
   assert(check_output_input_weight_same_parallel_is());
   parallel_is = outputs[0]->parallel_is;
@@ -523,6 +572,64 @@ void MultiHeadAttention::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap MultiHeadAttention::inference(
+    FFModel const &ff,
+    BatchConfigFuture const &bc,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "MultiHeadAttention op machine_view: " << *(MachineView const
+     *)mv
+            << std::endl; */
+  int idx = 0;
+  IndexLauncher launcher(ATTENTION_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[2]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(4, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 /*
   regions[0](I): query
   regions[1](I): key
diff --git a/src/ops/beam_topk.cc b/src/ops/beam_topk.cc
new file mode 100644
index 0000000000..93a6de5a8f
--- /dev/null
+++ b/src/ops/beam_topk.cc
@@ -0,0 +1,486 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/beam_topk.h"
+#include "flexflow/model.h"
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::InlineLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+// For an input tensor, computes the top k entries in each row
+// (resp. vector along the last dimension). Thus,
+// values.shape = indices.shape = input.shape[:-1] + [k]
+Tensor FFModel::beam_top_k(const Tensor input,
+                           int max_beam_width,
+                           bool sorted,
+                           char const *name) {
+  Layer *li = new Layer(this,
+                        OP_BEAM_TOPK,
+                        input->data_type,
+                        name,
+                        1 /*inputs*/,
+                        0 /*weights*/,
+                        3 /*outputs*/,
+                        input);
+  {
+    int numdims = input->num_dims;
+
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    dims[0] = max_beam_width;
+
+    std::cout << "beam input dimen:" << numdims << "\n";
+    for (int i = 0; i < numdims; i++) {
+      std::cout << input->dims[i] << ", ";
+    }
+
+    // beam width is dynamic
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, DT_INT32, li, 0, false /*create_grad*/);
+    li->outputs[1] = create_tensor_legion_ordering(
+        numdims, dims, DT_FLOAT, li, 1, false /*create_grad*/);
+    li->outputs[2] = create_tensor_legion_ordering(
+        numdims, dims, DT_INT32, li, 1, false /*create_grad*/);
+  }
+  li->add_int_property("sorted", sorted);
+  li->add_int_property("max_beam_width", max_beam_width);
+  layers.push_back(li);
+  // outputs[0] = li->outputs[0];
+  // outputs[1] = li->outputs[1];
+  return li->outputs[1];
+}
+
+Op *BeamTopK::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  long long value;
+  layer->get_int_property("sorted", value);
+  bool sorted = (bool)value;
+  layer->get_int_property("max_beam_width", value);
+  int max_beam_width = value;
+  return new BeamTopK(
+      model, inputs[0], layer->layer_guid, max_beam_width, sorted, layer->name);
+}
+
+BeamTopKParams BeamTopK::get_params() const {
+  BeamTopKParams params;
+  params.layer_guid = this->layer_guid;
+  params.sorted = this->sorted;
+  params.max_beam_width = this->max_beam_width;
+  return params;
+}
+
+bool BeamTopKParams::is_valid(ParallelTensorShape const &) const {
+  // topk is always valid
+  return true;
+}
+
+bool operator==(BeamTopKParams const &lhs, BeamTopKParams const &rhs) {
+  return lhs.layer_guid == rhs.layer_guid && lhs.sorted == rhs.sorted &&
+         lhs.max_beam_width == rhs.max_beam_width;
+}
+
+BeamTopK::BeamTopK(FFModel &model,
+                   const ParallelTensor _input,
+                   LayerID const &_layer_guid,
+                   int _max_beam_width,
+                   bool _sorted,
+                   char const *name)
+    : Op(model,
+         OP_BEAM_TOPK,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         0 /*weights*/,
+         3 /*outputs*/,
+         _input) {
+  sorted = _sorted;
+  max_beam_width = _max_beam_width;
+  layer_guid = _layer_guid;
+  int numdim = inputs[0]->num_dims;
+  assert(inputs[0]->dims[0].degree == 1);
+  assert(inputs[0]->dims[0].parallel_idx == -1);
+  //   outputs[0] = model.create_parallel_tensor_legion_ordering(
+  //       numdim, dims, _input->data_type, this, 0 /*owner_idx*/);
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      numdim, inputs[0]->dims, DT_INT32, this, 0 /*owner_idx*/);
+  outputs[1] = model.create_parallel_tensor_legion_ordering(
+      numdim, inputs[0]->dims, DT_FLOAT, this, 1 /*owner_idx*/);
+  outputs[2] = model.create_parallel_tensor_legion_ordering(
+      numdim, inputs[0]->dims, DT_INT32, this, 2 /*owner_idx*/);
+}
+
+BeamTopK::BeamTopK(FFModel &model,
+                   BeamTopK const &other,
+                   const ParallelTensor input)
+    : BeamTopK(model,
+               input,
+               other.layer_guid,
+               other.max_beam_width,
+               other.sorted,
+               other.name) {}
+
+BeamTopK::BeamTopK(FFModel &model,
+                   BeamTopKParams const &params,
+                   const ParallelTensor input,
+                   char const *name)
+    : BeamTopK(model,
+               input,
+               params.layer_guid,
+               params.max_beam_width,
+               params.sorted,
+               name) {}
+
+void BeamTopK::init_inference(FFModel const &ff,
+                              std::vector<ParallelTensor> const &batch_inputs,
+                              std::vector<ParallelTensor> const &batch_outputs,
+                              MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(BEAM_TOPK_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(BeamTopK)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[1]->region));
+  launcher.add_field(2, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[2]->region));
+  launcher.add_field(3, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void BeamTopK::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(BEAM_TOPK_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(BeamTopK)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[1]->region));
+  launcher.add_field(2, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[2]->region));
+  launcher.add_field(3, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+OpMeta *BeamTopK::init_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  BeamTopK *topk = (BeamTopK *)task->args;
+  FFHandler handle = *((FFHandler *)task->local_args);
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  BeamTopKMeta *m = new BeamTopKMeta(handle, topk, gpu_mem_allocator);
+  m->profiling = topk->profiling;
+  m->sorted = topk->sorted;
+  m->max_beam_width = topk->max_beam_width;
+  m->input_type[0] = topk->inputs[0]->data_type;
+  return m;
+}
+
+void BeamTopK::forward(FFModel const &ff) {
+  assert(false);
+}
+
+FutureMap BeamTopK::inference(FFModel const &ff,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &batch_inputs,
+                              std::vector<ParallelTensor> const &batch_outputs,
+                              MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+
+  IndexLauncher launcher(BEAM_TOPK_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[1]->region));
+  launcher.add_field(2, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[2]->region));
+  launcher.add_field(3, FID_DATA);
+
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+BeamInferenceResult
+    BeamTopK::inference_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+
+  assert(regions.size() == 4);
+  assert(task->regions.size() == 4);
+  // BeamSearchBatchConfig const *bc = (BeamSearchBatchConfig *)task->args;
+
+  BeamSearchBatchConfig const &bc =
+      Future(task->futures[0]).get_result<BeamSearchBatchConfig>();
+  // std::cout << "beam search topk inference: "
+  //           << "\n";
+  if (bc.num_tokens == 0) {
+    BeamInferenceResult ir;
+    return ir;
+  }
+
+  BeamTopKMeta const *m = *((BeamTopKMeta **)task->local_args);
+  Domain in1_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  //   Domain out1_domain = runtime->get_index_space_domain(
+  //       ctx, task->regions[1].region.get_index_space());
+  Domain out2_domain = runtime->get_index_space_domain(
+      ctx, task->regions[1].region.get_index_space());
+  int numdims = in1_domain.get_dim();
+
+  // float const *in_ptr = helperGetTensorPointerRO<float>(
+  //     regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  //   float *value_ptr = helperGetTensorPointerWO<float>(
+  //       regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  int *index_ptr = helperGetTensorPointerWO<int>(
+      regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  // );
+  float *value_ptr = helperGetTensorPointerWO<float>(
+      regions[2], task->regions[2], FID_DATA, ctx, runtime);
+
+  int *parent_ptr = helperGetTensorPointerWO<int>(
+      regions[3], task->regions[3], FID_DATA, ctx, runtime);
+  // embedding size: eg. 4096
+  int length = in1_domain.hi()[0] - in1_domain.lo()[0] + 1;
+
+  // int k = out2_domain.hi()[0] - out2_domain.lo()[0] + 1;
+
+  // total token nums
+  // size_t tokens_per_request = in1_domain.hi()[1] - in1_domain.lo()[1] + 1;
+  // size_t batch_size = in1_domain.get_volume() / length;
+  size_t batch_size = bc.num_active_tokens();
+  // std::vector<int> beam_width;
+  // std::unordered_map<size_t, int> sub_requests = bc->sub_requests;
+  // for (int i = 0; i < bc->MAX_NUM_REQUESTS; i++) {
+  //   if (bc->request_completed[i]) {
+  //     continue;
+  //   }
+  //   // add beam width for each main request
+  //   beam_width.push_back(sub_requests[i]);
+  //   std::cout << "sub req num: " <<sub_requests[i] << "\n";
+  // }
+
+  // need meta for: how many sub requests in a main request
+  BeamTopK::forward_kernel_wrapper(m,
+                                   &bc,
+                                   input,
+                                   value_ptr,
+                                   index_ptr,
+                                   parent_ptr,
+                                   batch_size,
+                                   length,
+                                   m->sorted);
+
+  BeamInferenceResult ir;
+
+  download_tensor<int>(index_ptr, ir.token_ids, batch_size * m->max_beam_width);
+  download_tensor<float>(value_ptr, ir.probs, batch_size * m->max_beam_width);
+  // if(m->output_type[0] == DT_FLOAT){
+  //     download_tensor<float>(value.get_float_ptr(), ir.probs, batch_size *
+  //     m->max_beam_width);
+  // }else if(m->output_type[0] == DT_HALF){
+  //     download_tensor<float>(value.get_half_ptr(), ir.probs, batch_size *
+  //     m->max_beam_width);
+  // }
+  download_tensor<int>(
+      parent_ptr, ir.parent_id, batch_size * m->max_beam_width);
+  return ir;
+}
+
+void BeamTopK::backward(FFModel const &ff) {
+  assert(false);
+}
+
+void BeamTopK::serialize(Legion::Serializer &sez) const {
+  sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
+  sez.serialize(this->sorted);
+  sez.serialize(this->max_beam_width);
+}
+
+Node BeamTopK::deserialize(FFModel &ff,
+                           Legion::Deserializer &dez,
+                           ParallelTensor inputs[],
+                           int num_inputs) {
+  assert(num_inputs == 1);
+  bool sorted;
+  size_t id, transformer_layer_id;
+  int max_beam_width;
+  dez.deserialize(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
+  dez.deserialize(sorted);
+  dez.deserialize(max_beam_width);
+  BeamTopKParams params;
+  params.layer_guid = layer_guid;
+  params.sorted = sorted;
+  params.max_beam_width = max_beam_width;
+  return ff.get_or_create_node<BeamTopK>(inputs[0], params);
+}
+
+Op *BeamTopK::materialize(FFModel &ff,
+                          ParallelTensor inputs[],
+                          int num_inputs) const {
+  BeamTopKParams params = get_params();
+  return new BeamTopK(ff, params, inputs[0], this->name);
+}
+
+bool BeamTopK::measure_operator_cost(Simulator *sim,
+                                     MachineView const &mv,
+                                     CostMetrics &cost_metrics) const {
+  return false;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::BeamTopKParams>::operator()(
+    FlexFlow::BeamTopKParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.sorted);
+  hash_combine(key, params.max_beam_width);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/beam_topk.cpp b/src/ops/beam_topk.cpp
new file mode 100644
index 0000000000..293feecff0
--- /dev/null
+++ b/src/ops/beam_topk.cpp
@@ -0,0 +1,705 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/beam_topk.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::coord_t;
+
+enum class HeapType { kMinHeap, kMaxHeap };
+enum class PreferIndices { kLower, kHigher };
+
+LegionRuntime::Logger::Category log_beam_topk("BeamTopK");
+
+template <typename T>
+struct Entry {
+  int index;
+  T value;
+};
+
+template <typename T>
+struct LinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return data[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+};
+
+template <typename T>
+struct IndirectLinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return backing_data[data[i].index].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+  Entry *const backing_data;
+};
+
+template <typename T>
+struct StridedData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index * blockDim.x + threadIdx.x];
+  }
+
+  __device__ int get_index(int i) const {
+    return (*this)[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return (*this)[i].value;
+  }
+
+  Entry *const data;
+};
+
+// A heap of Entry<T> that can either work as a min-heap or as a max-heap.
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+struct IndexedHeap {
+  typedef typename Data<T>::Entry Entry;
+  Data<T> const data;
+  __device__ IndexedHeap(Data<T> const &d) : data(d) {}
+
+  __device__ bool is_above(int left, int right) {
+    T left_value = data.get_value(left);
+    T right_value = data.get_value(right);
+    if (left_value == right_value) {
+      if (preferIndices == PreferIndices::kLower) {
+        return data.get_index(left) < data.get_index(right);
+      } else {
+        return data.get_index(left) > data.get_index(right);
+      }
+    }
+    if (heapType == HeapType::kMinHeap) {
+      return left_value < right_value;
+    } else {
+      return left_value > right_value;
+    }
+  }
+
+  __device__ void assign(int i, Entry const &entry) {
+    data[i] = entry;
+  }
+
+  __device__ void push_up(int i) {
+    int child = i;
+    int parent;
+    for (; child > 0; child = parent) {
+      parent = (child - 1) / 2;
+      if (!is_above(child, parent)) {
+        // Heap property satisfied.
+        break;
+      }
+      swap(child, parent);
+    }
+  }
+
+  __device__ void swap(int a, int b) {
+    auto tmp = data[b];
+    data[b] = data[a];
+    data[a] = tmp;
+  }
+
+  __device__ void push_root_down(int k) {
+    push_down(0, k);
+  }
+
+  // MAX-HEAPIFY in Cormen
+  __device__ void push_down(int node, int k) {
+    while (true) {
+      int const left = 2 * node + 1;
+      int const right = left + 1;
+      int smallest = node;
+      if (left < k && is_above(left, smallest)) {
+        smallest = left;
+      }
+      if (right < k && is_above(right, smallest)) {
+        smallest = right;
+      }
+      if (smallest == node) {
+        break;
+      }
+      swap(smallest, node);
+      node = smallest;
+    }
+  }
+
+  // BUILD-MAX-HEAPIFY in Cormen
+  __device__ void build(int k) {
+    for (int node = (k - 1) / 2; node >= 0; node--) {
+      push_down(node, k);
+    }
+  }
+
+  // HEAP-EXTRACT-MAX in Cormen
+  __device__ void remove_root(int k) {
+    data[0] = data[k - 1];
+    push_root_down(k - 1);
+  }
+
+  // in-place HEAPSORT in Cormen
+  // This method destroys the heap property.
+  __device__ void sort(int k) {
+    for (int slot = k - 1; slot > 0; slot--) {
+      // This is like remove_root but we insert the element at the end.
+      swap(slot, 0);
+      // Heap is now an element smaller.
+      push_root_down(/*k=*/slot);
+    }
+  }
+
+  __device__ void replace_root(Entry const &entry, int k) {
+    data[0] = entry;
+    push_root_down(k);
+  }
+
+  __device__ Entry const &root() {
+    return data[0];
+  }
+};
+
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+__device__ IndexedHeap<heapType, preferIndices, Data, T>
+    make_indexed_heap(typename Data<T>::Entry *data) {
+  return IndexedHeap<heapType, preferIndices, Data, T>{Data<T>{data}};
+}
+
+// heapBeamTopK walks over [input, input+length) with `step_size` stride
+// starting at `start_index`. It builds a top-`k` heap that is stored in
+// `heap_entries` using `Accessor` to access elements in `heap_entries`. If
+// sorted=true, the elements will be sorted at the end.
+template <typename T, template <typename> class Data = LinearData>
+__device__ void heapBeamTopK(T const *__restrict__ input,
+                             int batch_index,
+                             int length,
+                             int k,
+                             Entry<T> *__restrict__ heap_entries,
+                             bool sorted = false,
+                             int start_index = 0,
+                             int step_size = 1) {
+  assert(k <= length);
+  auto heap =
+      make_indexed_heap<HeapType::kMinHeap, PreferIndices::kHigher, Data, T>(
+          heap_entries);
+
+  int heap_end_index = start_index + k * step_size;
+  if (heap_end_index > length) {
+    heap_end_index = length;
+  }
+  // Initialize the min-heap.
+  for (int index = start_index, slot = 0; index < heap_end_index;
+       index += step_size, slot++) {
+    heap.assign(slot, {index, input[index]});
+  }
+
+  heap.build(k);
+
+  // Now iterate over the remaining items.
+  // If an item is smaller than the min element, it is not amongst the top k.
+  // Otherwise, replace the min element with it and push upwards.
+  for (int index = heap_end_index; index < length; index += step_size) {
+    // We prefer elements with lower indices. This is given here.
+    // Later elements automatically have higher indices, so can be discarded.
+    if (input[index] > heap.root().value) {
+      // This element should replace the min.
+      heap.replace_root({index, input[index]}, k);
+    }
+  }
+
+  // Sort if wanted.
+  if (sorted) {
+    heap.sort(k);
+  }
+
+  // if(batch_index == 0){
+  //   printf("top elemmments: %d, value %.15f\n", start_index,
+  //   heap.root().value);
+  // }
+}
+
+template <typename T>
+__device__ void mergeBeamShards(int num_shards,
+                                int batch_index,
+                                int k,
+                                int max_heap_size,
+                                int request_id,
+                                int *parent_id,
+                                T *probs,
+                                Entry<T> *__restrict__ entries,
+                                Entry<T> *__restrict__ top_k_heap,
+                                float *top_k_values,
+                                int *top_k_indices,
+                                int *top_k_parents,
+                                bool verbose) {
+  // If k < num_shards, we can use a min-heap with k elements to get the top k
+  // of the sorted blocks.
+  // If k > num_shards, we can initialize a min-heap with the top element from
+  // each sorted block.
+  int const heap_size = k < num_shards ? k : num_shards;
+  // printf("see value: %f", entries[0].value);
+  // Min-heap part.
+
+  {
+    auto min_heap = IndexedHeap<HeapType::kMinHeap,
+                                PreferIndices::kHigher,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Initialize the heap as a min-heap.
+    for (int slot = 0; slot < heap_size; slot++) {
+      // int beam = (slot % max_heap_size) / k;
+      T prob = probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                     ((slot % max_heap_size) / k)];
+      min_heap.assign(slot, {slot, (entries[slot].value * prob)});
+    }
+    min_heap.build(heap_size);
+
+    // Now perform top k with the remaining shards (if num_shards > heap_size).
+    for (int shard = heap_size; shard < num_shards; shard++) {
+      auto const entry = entries[shard];
+      auto const root = min_heap.root();
+
+      T prob = probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                     ((shard % max_heap_size) / k)];
+      if (entry.value * prob < root.value) {
+        continue;
+      }
+      if (entry.value * prob == root.value &&
+          entry.index > entries[root.index].index) {
+        continue;
+      }
+      // This element should replace the min.
+      min_heap.replace_root({shard, entry.value * prob}, heap_size);
+    }
+  }
+
+  // Max-part.
+  {
+    // Turn the min-heap into a max-heap in-place.
+    auto max_heap = IndexedHeap<HeapType::kMaxHeap,
+                                PreferIndices::kLower,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Heapify into a max heap.
+    max_heap.build(heap_size);
+
+    // Now extract the minimum k-1 times.
+    // k is treated specially.
+    int const last_k = k - 1;
+    for (int rank = 0; rank < last_k; rank++) {
+      Entry<T> const &max_element = max_heap.root();
+      top_k_values[rank] = __half2float(max_element.value);
+      int shard_index = max_element.index;
+      top_k_indices[rank] = entries[shard_index].index;
+      top_k_parents[rank] =
+          parent_id[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                    ((shard_index % max_heap_size) / k)];
+      int next_shard_index = shard_index + num_shards;
+
+      T prob = probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                     ((next_shard_index % max_heap_size) / k)];
+
+      max_heap.replace_root(
+          {next_shard_index, entries[next_shard_index].value * prob},
+          heap_size);
+    }
+
+    // rank == last_k.
+    Entry<T> const &max_element = max_heap.root();
+    top_k_values[last_k] = __half2float(max_element.value);
+    int shard_index = max_element.index;
+    top_k_indices[last_k] = entries[shard_index].index;
+    top_k_parents[last_k] =
+        parent_id[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                  ((shard_index % max_heap_size) / k)];
+  }
+}
+
+template <typename T>
+__global__ void
+    mergeSubRequestsKernel(int64_t N, T const *X, T const *rstd, T *Y) {
+  using T_ACC = T;
+  const int64_t i = blockIdx.x;
+  for (int64_t j = threadIdx.x; j < N; j += blockDim.x) {
+    const int64_t index = i * N + j;
+    Y[index] = static_cast<T_ACC>(X[index]) * static_cast<T_ACC>(rstd[i]);
+  }
+}
+
+template <typename T>
+__global__ void beam_topk_forward_kernel(T const *__restrict__ input,
+                                         size_t shared_memory_size,
+                                         int length,
+                                         int k,
+                                         int max_heap_size,
+                                         int *parent_ids,
+                                         T *acc_probs,
+                                         int *gpu_block_start_index,
+                                         int *gpu_request_id,
+                                         int *tokens_per_request,
+                                         bool sorted,
+                                         float *__restrict__ output,
+                                         int *__restrict__ indices,
+                                         int *__restrict__ parents,
+                                         bool verbose) {
+  __shared__ char shared_memory[48 << 10];
+  int const batch_index = blockIdx.x;
+  // T const *batch_input = input + batch_index * length;
+  int const thread_index = threadIdx.x;
+  int const thread_count = blockDim.x;
+  int const request_id = gpu_request_id[batch_index];
+  int const token_nums = tokens_per_request[batch_index];
+  Entry<T> *shared_entries = (Entry<T> *)shared_memory;
+
+  int sub_request_id = thread_index / k;
+  // if (verbose) {
+  //   printf("beam kernel: batch_index: %d, thread_index %d, sub_request_id %d,
+  //   "
+  //          "request_id %d, token_nums %d\n",
+  //          batch_index,
+  //          thread_index,
+  //          sub_request_id,
+  //          request_id,
+  //          token_nums);
+  // }
+
+  T const *batch_input = input + gpu_block_start_index[batch_index] +
+                         (sub_request_id * token_nums * length);
+
+  // printf("thread index %d, thread_count %d, batch_index %d\n", thread_index,
+  // thread_count, batch_index);
+  heapBeamTopK<T, StridedData>(batch_input,
+                               batch_index,
+                               length,
+                               k,
+                               shared_entries,
+                               true,
+                               thread_index % k,
+                               k);
+  __syncthreads();
+  // printf("beam thread index %d, thread_count %d, thread index %d, batch_index
+  // "
+  //        "%d, k %d, parent_id %d, acc_prob: %f, sub id: %d, request_id: %d,
+  //        offset: %d, offset2 %d, sub_request_id %d\n", thread_index,
+  //        thread_count,
+  //        thread_index,
+  //        batch_index,
+  //        k,
+  //        parent_ids[request_id * BatchConfig::MAX_NUM_BEAMS +
+  //        sub_request_id], acc_probs[request_id * BatchConfig::MAX_NUM_BEAMS +
+  //        sub_request_id], sub_request_id, request_id,
+  //        gpu_block_start_index[batch_index],
+  //        batch_index * length,
+  //        sub_request_id);
+
+  if (thread_index == 0) {
+    // merge beam_width heaps and store the parent
+    // find which req it belongs to, replace the offset
+    // printf("merge heaps, batch index: %d, sub_request_id %d, value %f\n",
+    //       batch_index,
+    //       sub_request_id,
+    //       acc_probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+    //                 sub_request_id]);
+    int const offset = batch_index * k;
+    auto batch_output = output + offset;
+    auto batch_indices = indices + offset;
+    auto batch_parents = parents + offset;
+    Entry<T> *top_k_heap = shared_entries + thread_count * k;
+
+    // if(batch_index == 0 && verbose) {
+    //   for(int i = 0; i < 18; i++){
+    //       printf("see value: %.15f\n", shared_entries[i].value);
+    //   }
+    // }
+
+    // get parent/acc based on the sub request and main request
+    mergeBeamShards(thread_count,
+                    batch_index,
+                    k,
+                    max_heap_size,
+                    request_id,
+                    parent_ids,
+                    acc_probs,
+                    shared_entries,
+                    top_k_heap,
+                    batch_output,
+                    batch_indices,
+                    batch_parents,
+                    verbose /*verbose prints*/);
+  }
+}
+
+/*static*/
+template <typename DT>
+void BeamTopK::forward_kernel(BeamTopKMeta const *m,
+                              BeamSearchBatchConfig const *bc,
+                              DT const *input_ptr,
+                              float *output_ptr,
+                              int *indices_ptr,
+                              int *parent_ptr,
+                              int batch_size,
+                              int length,
+                              bool sorted,
+                              hipStream_t stream) {
+  // Adopted from TensorFlow's BeamTopK implementation
+  // https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/topk_op_gpu.h
+
+  int num_shards = 0;
+  int max_heap_size = 0;
+  int max_beam_width = 0;
+  int req_index = 0;
+
+  // sub request
+  int const *sub_requests = bc->sub_requests;
+
+  // std::vector<BatchConfig::BeamSlot> beam_slots = bc->beam_slots;
+  // assert(bc->beam_slots.size() > 0);
+
+  int beam_num_blocks = 0;
+  std::vector<int> beam_block_start_index;
+  std::vector<int> request_id;
+  std::vector<int> tokens_per_request;
+
+  int block_start_index = 0;
+
+  // a data structure for prob, parent_id,
+  int max_total_requests =
+      BeamSearchBatchConfig::MAX_BEAM_WIDTH * bc->num_active_requests();
+  int parent_ids[max_total_requests];
+  DT acc_probs[max_total_requests];
+
+  for (int i = 0; i < bc->MAX_NUM_REQUESTS; i++) {
+    if (bc->request_completed[i]) {
+      continue;
+    }
+    assert(bc->beamRequestsInfo[i].beam_size > 0);
+
+    // int num_new_tokens = bc->num_processing_tokens[i];
+    int num_new_tokens = bc->requestsInfo[i].num_tokens_in_batch;
+
+    // get beam size;
+    int beam_size = bc->beamRequestsInfo[i].beam_size;
+
+    // initial request
+    log_beam_topk.debug() << "sub_requests: " << i << ", " << sub_requests[i]
+                          << "\n";
+    assert(sub_requests[i] > 0);
+    // process sub requests
+    for (int j = 0; j < sub_requests[i]; j++) {
+      parent_ids[req_index * BeamSearchBatchConfig::MAX_BEAM_WIDTH + j] = j;
+      // beam_slots[i].parent_id[j];
+      acc_probs[req_index * BeamSearchBatchConfig::MAX_BEAM_WIDTH + j] =
+          bc->beamRequestsInfo[i].probs[j];
+      log_beam_topk.debug()
+          << "probbbb req: " << i
+          << ", sub req probability : " << bc->beamRequestsInfo[i].probs[j]
+          << ", sub request id " << j << ", parent id "
+          << bc->beamRequestsInfo[i].parent_id[j] << ", data inddd"
+          << req_index * BeamSearchBatchConfig::MAX_BEAM_WIDTH + j << "\n";
+    }
+
+    // process tokens
+    for (int k = 0; k < num_new_tokens; k++) {
+      beam_block_start_index.push_back(block_start_index);
+      request_id.push_back(i);
+      tokens_per_request.push_back(num_new_tokens);
+      block_start_index += length;
+      beam_num_blocks++;
+    }
+
+    max_heap_size = std::max(max_heap_size, beam_size * sub_requests[i]);
+    max_beam_width = std::max(max_beam_width, beam_size);
+    req_index += 1;
+    block_start_index += (sub_requests[i] - 1) * num_new_tokens * length;
+  }
+  log_beam_topk.debug() << "what index: " << block_start_index
+                        << ", block num: " << beam_num_blocks << "\n";
+
+  assert(batch_size >= beam_num_blocks);
+  assert(bc->num_active_requests() == req_index);
+
+  {
+    constexpr auto shared_memory_size = 48 << 10;
+    auto const heap_size = max_heap_size * sizeof(Entry<DT>);
+    // shared_memory_size = (num_shards + 1) * heap_size <=>
+    num_shards = shared_memory_size / heap_size - 1;
+    assert(num_shards > 0);
+    if (num_shards > CUDA_NUM_THREADS) {
+      num_shards = CUDA_NUM_THREADS;
+    }
+    log_beam_topk.debug() << "maxheap size:  " << max_heap_size << "\n";
+    log_beam_topk.debug() << "maxbeam width:  " << max_beam_width
+                          << ", heap size: " << heap_size << "\n";
+  }
+  // We are limited by the amount of shared memory we have per block.
+  size_t shared_memory_size =
+      (num_shards + 1) * max_heap_size * sizeof(Entry<DT>);
+
+  assert(num_shards >= (size_t)max_heap_size);
+  num_shards = max_heap_size;
+
+  checkCUDA(hipMemcpy(m->parent_ids,
+                      parent_ids,
+                      sizeof(int) * max_total_requests,
+                      hipMemcpyHostToDevice));
+  checkCUDA(hipMemcpy(m->acc_probs,
+                      acc_probs,
+                      sizeof(DT) * max_total_requests,
+                      hipMemcpyHostToDevice));
+  checkCUDA(hipMemcpy(m->block_start_index,
+                      beam_block_start_index.data(),
+                      sizeof(int) * beam_num_blocks,
+                      hipMemcpyHostToDevice));
+  checkCUDA(hipMemcpy(m->request_id,
+                      request_id.data(),
+                      sizeof(int) * beam_num_blocks,
+                      hipMemcpyHostToDevice));
+  checkCUDA(hipMemcpy(m->tokens_per_request,
+                      tokens_per_request.data(),
+                      sizeof(int) * beam_num_blocks,
+                      hipMemcpyHostToDevice));
+  // int depth =
+  //     bc->beamRequestsInfo[bc->tokensInfo[0].request_index].current_depth;
+  beam_topk_forward_kernel<<<beam_num_blocks, num_shards, 0, stream>>>(
+      input_ptr,
+      shared_memory_size,
+      length,
+      max_beam_width,
+      max_heap_size,
+      m->parent_ids,
+      static_cast<DT *>(m->acc_probs),
+      m->block_start_index,
+      m->request_id,
+      m->tokens_per_request,
+      sorted,
+      output_ptr,
+      indices_ptr,
+      parent_ptr,
+      false /*verbose*/ // depth == 1
+  );
+
+  // merge sub
+}
+
+/*static*/
+void BeamTopK::forward_kernel_wrapper(BeamTopKMeta const *m,
+                                      BeamSearchBatchConfig const *bc,
+                                      GenericTensorAccessorR const &input,
+                                      float *output_ptr,
+                                      int *indices_ptr,
+                                      int *parent_ptr,
+                                      int batch_size,
+                                      int length,
+                                      bool sorted) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  if (input.data_type == DT_HALF) {
+    BeamTopK::forward_kernel(m,
+                             bc,
+                             input.get_half_ptr(),
+                             output_ptr,
+                             indices_ptr,
+                             parent_ptr,
+                             batch_size,
+                             length,
+                             sorted,
+                             stream);
+  } else if (input.data_type == DT_FLOAT) {
+    BeamTopK::forward_kernel(m,
+                             bc,
+                             input.get_float_ptr(),
+                             output_ptr,
+                             indices_ptr,
+                             parent_ptr,
+                             batch_size,
+                             length,
+                             sorted,
+                             stream);
+  }
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+    printf("[BeamTopK] forward time = %.2lfms\n", elapsed);
+  }
+}
+
+BeamTopKMeta::BeamTopKMeta(FFHandler handler,
+                           Op const *op,
+                           MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler) {
+  DataType data_type = op->inputs[0]->data_type;
+  checkCUDA(hipMalloc(&parent_ids,
+                      sizeof(int) * BeamSearchBatchConfig::MAX_BEAM_WIDTH *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS));
+  checkCUDA(hipMalloc(&acc_probs,
+                      sizeof(data_type_size(data_type)) *
+                          BeamSearchBatchConfig::MAX_BEAM_WIDTH *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS));
+  checkCUDA(hipMalloc(&block_start_index,
+                      sizeof(int) * BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS));
+  checkCUDA(hipMalloc(&request_id,
+                      sizeof(int) * BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS));
+  checkCUDA(hipMalloc(&tokens_per_request,
+                      sizeof(int) * BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS));
+}
+
+BeamTopKMeta::~BeamTopKMeta(void) {}
+}; // namespace FlexFlow
diff --git a/src/ops/beam_topk.cu b/src/ops/beam_topk.cu
new file mode 100644
index 0000000000..42fa7a5ab5
--- /dev/null
+++ b/src/ops/beam_topk.cu
@@ -0,0 +1,756 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/beam_topk.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::coord_t;
+
+enum class HeapType { kMinHeap, kMaxHeap };
+enum class PreferIndices { kLower, kHigher };
+
+LegionRuntime::Logger::Category log_beam_topk("BeamTopK");
+
+template <typename T>
+struct Entry {
+  int index;
+  T value;
+};
+
+template <typename T>
+struct LinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return data[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+};
+
+template <typename T>
+struct IndirectLinearData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index];
+  }
+
+  __device__ int get_index(int i) const {
+    return backing_data[data[i].index].index;
+  }
+  __device__ T get_value(int i) const {
+    return data[i].value;
+  }
+
+  Entry *const data;
+  Entry *const backing_data;
+};
+
+template <typename T>
+struct StridedData {
+  typedef Entry<T> Entry;
+
+  __device__ Entry &operator[](std::size_t index) const {
+    return data[index * blockDim.x + threadIdx.x];
+  }
+
+  __device__ int get_index(int i) const {
+    return (*this)[i].index;
+  }
+  __device__ T get_value(int i) const {
+    return (*this)[i].value;
+  }
+
+  Entry *const data;
+};
+
+// A heap of Entry<T> that can either work as a min-heap or as a max-heap.
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+struct IndexedHeap {
+  typedef typename Data<T>::Entry Entry;
+  Data<T> const data;
+  __device__ IndexedHeap(Data<T> const &d) : data(d) {}
+
+  __device__ bool is_above(int left, int right) {
+    T left_value = data.get_value(left);
+    T right_value = data.get_value(right);
+    if (left_value == right_value) {
+      if (preferIndices == PreferIndices::kLower) {
+        return data.get_index(left) < data.get_index(right);
+      } else {
+        return data.get_index(left) > data.get_index(right);
+      }
+    }
+    if (heapType == HeapType::kMinHeap) {
+      return left_value < right_value;
+    } else {
+      return left_value > right_value;
+    }
+  }
+
+  __device__ void assign(int i, Entry const &entry) {
+    data[i] = entry;
+  }
+
+  __device__ void push_up(int i) {
+    int child = i;
+    int parent;
+    for (; child > 0; child = parent) {
+      parent = (child - 1) / 2;
+      if (!is_above(child, parent)) {
+        // Heap property satisfied.
+        break;
+      }
+      swap(child, parent);
+    }
+  }
+
+  __device__ void swap(int a, int b) {
+    auto tmp = data[b];
+    data[b] = data[a];
+    data[a] = tmp;
+  }
+
+  __device__ void push_root_down(int k) {
+    push_down(0, k);
+  }
+
+  // MAX-HEAPIFY in Cormen
+  __device__ void push_down(int node, int k) {
+    while (true) {
+      int const left = 2 * node + 1;
+      int const right = left + 1;
+      int smallest = node;
+      if (left < k && is_above(left, smallest)) {
+        smallest = left;
+      }
+      if (right < k && is_above(right, smallest)) {
+        smallest = right;
+      }
+      if (smallest == node) {
+        break;
+      }
+      swap(smallest, node);
+      node = smallest;
+    }
+  }
+
+  // BUILD-MAX-HEAPIFY in Cormen
+  __device__ void build(int k) {
+    for (int node = (k - 1) / 2; node >= 0; node--) {
+      push_down(node, k);
+    }
+  }
+
+  // HEAP-EXTRACT-MAX in Cormen
+  __device__ void remove_root(int k) {
+    data[0] = data[k - 1];
+    push_root_down(k - 1);
+  }
+
+  // in-place HEAPSORT in Cormen
+  // This method destroys the heap property.
+  __device__ void sort(int k) {
+    for (int slot = k - 1; slot > 0; slot--) {
+      // This is like remove_root but we insert the element at the end.
+      swap(slot, 0);
+      // Heap is now an element smaller.
+      push_root_down(/*k=*/slot);
+    }
+  }
+
+  __device__ void replace_root(Entry const &entry, int k) {
+    data[0] = entry;
+    push_root_down(k);
+  }
+
+  __device__ Entry const &root() {
+    return data[0];
+  }
+};
+
+template <HeapType heapType,
+          PreferIndices preferIndices,
+          template <typename>
+          class Data,
+          typename T>
+__device__ IndexedHeap<heapType, preferIndices, Data, T>
+    make_indexed_heap(typename Data<T>::Entry *data) {
+  return IndexedHeap<heapType, preferIndices, Data, T>{Data<T>{data}};
+}
+
+// heapBeamTopK walks over [input, input+length) with `step_size` stride
+// starting at `start_index`. It builds a top-`k` heap that is stored in
+// `heap_entries` using `Accessor` to access elements in `heap_entries`. If
+// sorted=true, the elements will be sorted at the end.
+template <typename T, template <typename> class Data = LinearData>
+__device__ void heapBeamTopK(T const *__restrict__ input,
+                             int batch_index,
+                             int length,
+                             int k,
+                             Entry<T> *__restrict__ heap_entries,
+                             bool sorted = false,
+                             int start_index = 0,
+                             int step_size = 1) {
+  assert(k <= length);
+  auto heap =
+      make_indexed_heap<HeapType::kMinHeap, PreferIndices::kHigher, Data, T>(
+          heap_entries);
+
+  int heap_end_index = start_index + k * step_size;
+  if (heap_end_index > length) {
+    heap_end_index = length;
+  }
+  // Initialize the min-heap.
+  for (int index = start_index, slot = 0; index < heap_end_index;
+       index += step_size, slot++) {
+    heap.assign(slot, {index, input[index]});
+  }
+
+  heap.build(k);
+
+  // Now iterate over the remaining items.
+  // If an item is smaller than the min element, it is not amongst the top k.
+  // Otherwise, replace the min element with it and push upwards.
+  for (int index = heap_end_index; index < length; index += step_size) {
+    // We prefer elements with lower indices. This is given here.
+    // Later elements automatically have higher indices, so can be discarded.
+    if (input[index] > heap.root().value) {
+      // This element should replace the min.
+      heap.replace_root({index, input[index]}, k);
+    }
+  }
+
+  // Sort if wanted.
+  if (sorted) {
+    heap.sort(k);
+  }
+
+  // if(batch_index == 0){
+  //   printf("top elemmments: %d, value %.15f\n", start_index,
+  //   heap.root().value);
+  // }
+}
+
+template <typename T>
+__device__ void mergeBeamShards(int num_shards,
+                                int batch_index,
+                                int k,
+                                int max_heap_size,
+                                int request_id,
+                                int *parent_id,
+                                T *probs,
+                                Entry<T> *__restrict__ entries,
+                                Entry<T> *__restrict__ top_k_heap,
+                                float *top_k_values,
+                                int *top_k_indices,
+                                int *top_k_parents,
+                                bool verbose) {
+  // If k < num_shards, we can use a min-heap with k elements to get the top k
+  // of the sorted blocks.
+  // If k > num_shards, we can initialize a min-heap with the top element from
+  // each sorted block.
+  int const heap_size = k < num_shards ? k : num_shards;
+  // printf("see value: %f", entries[0].value);
+  // Min-heap part.
+
+  {
+    auto min_heap = IndexedHeap<HeapType::kMinHeap,
+                                PreferIndices::kHigher,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Initialize the heap as a min-heap.
+    for (int slot = 0; slot < heap_size; slot++) {
+      // int beam = (slot % max_heap_size) / k;
+      T prob = probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                     ((slot % max_heap_size) / k)];
+      min_heap.assign(slot, {slot, (entries[slot].value * prob)});
+      if (verbose && batch_index == 0) {
+        printf("slot %d, value %.15f, prob %15f\n",
+               slot,
+               static_cast<float>(entries[slot].value),
+               static_cast<float>(prob));
+      }
+    }
+    min_heap.build(heap_size);
+
+    // Now perform top k with the remaining shards (if num_shards > heap_size).
+    for (int shard = heap_size; shard < num_shards; shard++) {
+      auto const entry = entries[shard];
+      auto const root = min_heap.root();
+
+      T prob = probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                     ((shard % max_heap_size) / k)];
+      if (verbose && batch_index == 0) {
+        printf("shard %d, index %d, value %.15f, prob %.15f\n",
+               shard,
+               entry.index,
+               static_cast<float>(entry.value),
+               static_cast<float>(prob));
+      }
+      if (entry.value * prob < root.value) {
+        continue;
+      }
+      if (entry.value * prob == root.value &&
+          entry.index > entries[root.index].index) {
+        continue;
+      }
+      // This element should replace the min.
+      min_heap.replace_root({shard, entry.value * prob}, heap_size);
+    }
+  }
+
+  // Max-part.
+  {
+    // Turn the min-heap into a max-heap in-place.
+    auto max_heap = IndexedHeap<HeapType::kMaxHeap,
+                                PreferIndices::kLower,
+                                IndirectLinearData,
+                                T>{IndirectLinearData<T>{top_k_heap, entries}};
+    // Heapify into a max heap.
+    max_heap.build(heap_size);
+
+    // Now extract the minimum k-1 times.
+    // k is treated specially.
+    int const last_k = k - 1;
+    for (int rank = 0; rank < last_k; rank++) {
+      Entry<T> const &max_element = max_heap.root();
+      top_k_values[rank] = __half2float(max_element.value);
+      int shard_index = max_element.index;
+      top_k_indices[rank] = entries[shard_index].index;
+      top_k_parents[rank] =
+          parent_id[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                    ((shard_index % max_heap_size) / k)];
+      int next_shard_index = shard_index + num_shards;
+
+      T prob = probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                     ((next_shard_index % max_heap_size) / k)];
+      // if (batch_index == 0) {
+      //   printf("next_shard_index %d, value %.15f, prob %.15f\n",
+      //          next_shard_index,
+      //          entries[next_shard_index].value,
+      //          prob);
+      // }
+      max_heap.replace_root(
+          {next_shard_index, entries[next_shard_index].value * prob},
+          heap_size);
+    }
+
+    // rank == last_k.
+    Entry<T> const &max_element = max_heap.root();
+    top_k_values[last_k] = __half2float(max_element.value);
+    int shard_index = max_element.index;
+    top_k_indices[last_k] = entries[shard_index].index;
+    top_k_parents[last_k] =
+        parent_id[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                  ((shard_index % max_heap_size) / k)];
+  }
+}
+
+template <typename T>
+__global__ void
+    mergeSubRequestsKernel(int64_t N, T const *X, T const *rstd, T *Y) {
+  using T_ACC = T;
+  const int64_t i = blockIdx.x;
+  for (int64_t j = threadIdx.x; j < N; j += blockDim.x) {
+    const int64_t index = i * N + j;
+    Y[index] = static_cast<T_ACC>(X[index]) * static_cast<T_ACC>(rstd[i]);
+  }
+}
+
+template <typename T>
+__global__ void beam_topk_forward_kernel(T const *__restrict__ input,
+                                         size_t shared_memory_size,
+                                         int length,
+                                         int k,
+                                         int max_heap_size,
+                                         int *parent_ids,
+                                         T *acc_probs,
+                                         int *gpu_block_start_index,
+                                         int *gpu_request_id,
+                                         int *tokens_per_request,
+                                         bool sorted,
+                                         float *__restrict__ output,
+                                         int *__restrict__ indices,
+                                         int *__restrict__ parents,
+                                         bool verbose) {
+  __shared__ char shared_memory[48 << 10];
+  int const batch_index = blockIdx.x;
+  // T const *batch_input = input + batch_index * length;
+  int const thread_index = threadIdx.x;
+  int const thread_count = blockDim.x;
+  int const request_id = gpu_request_id[batch_index];
+  int const token_nums = tokens_per_request[batch_index];
+  Entry<T> *shared_entries = (Entry<T> *)shared_memory;
+
+  int sub_request_id = thread_index / k;
+  // if (verbose) {
+  //   printf("beam kernel: batch_index: %d, thread_index %d, sub_request_id %d,
+  //   "
+  //          "request_id %d, token_nums %d\n",
+  //          batch_index,
+  //          thread_index,
+  //          sub_request_id,
+  //          request_id,
+  //          token_nums);
+  // }
+
+  T const *batch_input = input + gpu_block_start_index[batch_index] +
+                         (sub_request_id * token_nums * length);
+
+  if (verbose && batch_index == 0) {
+    printf("request 0 start index: thread index %d, offset %d, batch_input %p, "
+           "acc index %d acc "
+           "prob %f, thread_count %d, request_id %d\n",
+           thread_index,
+           gpu_block_start_index[batch_index] +
+               (sub_request_id * token_nums * length),
+           batch_input,
+           request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH + sub_request_id,
+           static_cast<float>(
+               acc_probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+                         sub_request_id]),
+           thread_count,
+           request_id);
+  }
+  // printf("thread index %d, thread_count %d, batch_index %d\n", thread_index,
+  // thread_count, batch_index);
+  heapBeamTopK<T, StridedData>(batch_input,
+                               batch_index,
+                               length,
+                               k,
+                               shared_entries,
+                               true,
+                               thread_index % k,
+                               k);
+  __syncthreads();
+  // printf("beam thread index %d, thread_count %d, thread index %d, batch_index
+  // "
+  //        "%d, k %d, parent_id %d, acc_prob: %f, sub id: %d, request_id: %d,
+  //        offset: %d, offset2 %d, sub_request_id %d\n", thread_index,
+  //        thread_count,
+  //        thread_index,
+  //        batch_index,
+  //        k,
+  //        parent_ids[request_id * BatchConfig::MAX_NUM_BEAMS +
+  //        sub_request_id], acc_probs[request_id * BatchConfig::MAX_NUM_BEAMS +
+  //        sub_request_id], sub_request_id, request_id,
+  //        gpu_block_start_index[batch_index],
+  //        batch_index * length,
+  //        sub_request_id);
+
+  if (thread_index == 0) {
+    // merge beam_width heaps and store the parent
+    // find which req it belongs to, replace the offset
+    // printf("merge heaps, batch index: %d, sub_request_id %d, value %f\n",
+    //       batch_index,
+    //       sub_request_id,
+    //       acc_probs[request_id * BeamSearchBatchConfig::MAX_BEAM_WIDTH +
+    //                 sub_request_id]);
+    int const offset = batch_index * k;
+    auto batch_output = output + offset;
+    auto batch_indices = indices + offset;
+    auto batch_parents = parents + offset;
+    Entry<T> *top_k_heap = shared_entries + thread_count * k;
+
+    // if(batch_index == 0 && verbose) {
+    //   for(int i = 0; i < 18; i++){
+    //       printf("see value: %.15f\n", shared_entries[i].value);
+    //   }
+    // }
+
+    // get parent/acc based on the sub request and main request
+    mergeBeamShards(thread_count,
+                    batch_index,
+                    k,
+                    max_heap_size,
+                    request_id,
+                    parent_ids,
+                    acc_probs,
+                    shared_entries,
+                    top_k_heap,
+                    batch_output,
+                    batch_indices,
+                    batch_parents,
+                    verbose /*verbose prints*/);
+  }
+}
+
+/*static*/
+template <typename DT>
+void BeamTopK::forward_kernel(BeamTopKMeta const *m,
+                              BeamSearchBatchConfig const *bc,
+                              DT const *input_ptr,
+                              float *output_ptr,
+                              int *indices_ptr,
+                              int *parent_ptr,
+                              int batch_size,
+                              int length,
+                              bool sorted,
+                              cudaStream_t stream) {
+  // Adopted from TensorFlow's BeamTopK implementation
+  // https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/topk_op_gpu.h
+
+  int num_shards = 0;
+  int max_heap_size = 0;
+  int max_beam_width = 0;
+  int req_index = 0;
+
+  // sub request
+  int const *sub_requests = bc->sub_requests;
+
+  // std::vector<BatchConfig::BeamSlot> beam_slots = bc->beam_slots;
+  // assert(bc->beam_slots.size() > 0);
+
+  int beam_num_blocks = 0;
+  std::vector<int> beam_block_start_index;
+  std::vector<int> request_id;
+  std::vector<int> tokens_per_request;
+
+  int block_start_index = 0;
+
+  // a data structure for prob, parent_id,
+  int max_total_requests =
+      BeamSearchBatchConfig::MAX_BEAM_WIDTH * bc->num_active_requests();
+  int parent_ids[max_total_requests];
+  DT acc_probs[max_total_requests];
+
+  for (int i = 0; i < bc->MAX_NUM_REQUESTS; i++) {
+    if (bc->request_completed[i]) {
+      continue;
+    }
+    assert(bc->beamRequestsInfo[i].beam_size > 0);
+
+    // int num_new_tokens = bc->num_processing_tokens[i];
+    int num_new_tokens = bc->requestsInfo[i].num_tokens_in_batch;
+
+    // get beam size;
+    int beam_size = bc->beamRequestsInfo[i].beam_size;
+
+    // initial request
+    log_beam_topk.debug() << "sub_requests: " << i << ", " << sub_requests[i]
+                          << "\n";
+    assert(sub_requests[i] > 0);
+    // process sub requests
+    for (int j = 0; j < sub_requests[i]; j++) {
+      parent_ids[req_index * BeamSearchBatchConfig::MAX_BEAM_WIDTH + j] = j;
+      // beam_slots[i].parent_id[j];
+      acc_probs[req_index * BeamSearchBatchConfig::MAX_BEAM_WIDTH + j] =
+          bc->beamRequestsInfo[i].probs[j];
+      log_beam_topk.debug()
+          << "probbbb req: " << i
+          << ", sub req probability : " << bc->beamRequestsInfo[i].probs[j]
+          << ", sub request id " << j << ", parent id "
+          << bc->beamRequestsInfo[i].parent_id[j] << ", data inddd"
+          << req_index * BeamSearchBatchConfig::MAX_BEAM_WIDTH + j << "\n";
+    }
+
+    // process tokens
+    for (int k = 0; k < num_new_tokens; k++) {
+      beam_block_start_index.push_back(block_start_index);
+      request_id.push_back(i);
+      tokens_per_request.push_back(num_new_tokens);
+      block_start_index += length;
+      beam_num_blocks++;
+    }
+
+    max_heap_size = std::max(max_heap_size, beam_size * sub_requests[i]);
+    max_beam_width = std::max(max_beam_width, beam_size);
+    req_index += 1;
+    block_start_index += (sub_requests[i] - 1) * num_new_tokens * length;
+  }
+  log_beam_topk.debug() << "what index: " << block_start_index
+                        << ", block num: " << beam_num_blocks << "\n";
+
+  assert(batch_size >= beam_num_blocks);
+  assert(bc->num_active_requests() == req_index);
+
+  {
+    constexpr auto shared_memory_size = 48 << 10;
+    auto const heap_size = max_heap_size * sizeof(Entry<DT>);
+    // shared_memory_size = (num_shards + 1) * heap_size <=>
+    num_shards = shared_memory_size / heap_size - 1;
+    assert(num_shards > 0);
+    if (num_shards > CUDA_NUM_THREADS) {
+      num_shards = CUDA_NUM_THREADS;
+    }
+    log_beam_topk.debug() << "maxheap size:  " << max_heap_size << "\n";
+    log_beam_topk.debug() << "maxbeam width:  " << max_beam_width
+                          << ", heap size: " << heap_size << "\n";
+  }
+  // We are limited by the amount of shared memory we have per block.
+  size_t shared_memory_size =
+      (num_shards + 1) * max_heap_size * sizeof(Entry<DT>);
+
+  assert(num_shards >= (size_t)max_heap_size);
+  num_shards = max_heap_size;
+
+  checkCUDA(cudaMemcpy(m->parent_ids,
+                       parent_ids,
+                       sizeof(int) * max_total_requests,
+                       cudaMemcpyHostToDevice));
+  checkCUDA(cudaMemcpy(m->acc_probs,
+                       acc_probs,
+                       sizeof(DT) * max_total_requests,
+                       cudaMemcpyHostToDevice));
+  checkCUDA(cudaMemcpy(m->block_start_index,
+                       beam_block_start_index.data(),
+                       sizeof(int) * beam_num_blocks,
+                       cudaMemcpyHostToDevice));
+  checkCUDA(cudaMemcpy(m->request_id,
+                       request_id.data(),
+                       sizeof(int) * beam_num_blocks,
+                       cudaMemcpyHostToDevice));
+  checkCUDA(cudaMemcpy(m->tokens_per_request,
+                       tokens_per_request.data(),
+                       sizeof(int) * beam_num_blocks,
+                       cudaMemcpyHostToDevice));
+  // int depth =
+  //     bc->beamRequestsInfo[bc->tokensInfo[0].request_index].current_depth;
+  beam_topk_forward_kernel<<<beam_num_blocks, num_shards, 0, stream>>>(
+      input_ptr,
+      shared_memory_size,
+      length,
+      max_beam_width,
+      max_heap_size,
+      m->parent_ids,
+      static_cast<DT *>(m->acc_probs),
+      m->block_start_index,
+      m->request_id,
+      m->tokens_per_request,
+      sorted,
+      output_ptr,
+      indices_ptr,
+      parent_ptr,
+      false /*verbose*/ // depth == 1
+  );
+
+  // merge sub
+}
+
+/*static*/
+void BeamTopK::forward_kernel_wrapper(BeamTopKMeta const *m,
+                                      BeamSearchBatchConfig const *bc,
+                                      GenericTensorAccessorR const &input,
+                                      float *output_ptr,
+                                      int *indices_ptr,
+                                      int *parent_ptr,
+                                      int batch_size,
+                                      int length,
+                                      bool sorted) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  if (input.data_type == DT_HALF) {
+    BeamTopK::forward_kernel(m,
+                             bc,
+                             input.get_half_ptr(),
+                             output_ptr,
+                             indices_ptr,
+                             parent_ptr,
+                             batch_size,
+                             length,
+                             sorted,
+                             stream);
+  } else if (input.data_type == DT_FLOAT) {
+    BeamTopK::forward_kernel(m,
+                             bc,
+                             input.get_float_ptr(),
+                             output_ptr,
+                             indices_ptr,
+                             parent_ptr,
+                             batch_size,
+                             length,
+                             sorted,
+                             stream);
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("[BeamTopK] forward time = %.2lfms\n", elapsed);
+  }
+}
+
+BeamTopKMeta::BeamTopKMeta(FFHandler handler,
+                           Op const *op,
+                           MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler) {
+  DataType data_type = op->inputs[0]->data_type;
+  size_t parent_id_size = BeamSearchBatchConfig::MAX_BEAM_WIDTH *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+  size_t acc_probs_size = BeamSearchBatchConfig::MAX_BEAM_WIDTH *
+                          BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+  size_t block_start_index_size = BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                                  BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+  size_t request_id_size = BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                           BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+  size_t tokens_per_request_size = BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                                   BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+  size_t totalSize = sizeof(int) * parent_id_size +
+                     data_type_size(data_type) * acc_probs_size +
+                     sizeof(int) * block_start_index_size +
+                     sizeof(int) * request_id_size +
+                     sizeof(int) * tokens_per_request_size;
+
+  gpu_mem_allocator.create_legion_instance(reserveInst, totalSize);
+  parent_ids = gpu_mem_allocator.allocate_instance<int>(parent_id_size);
+  if (data_type == DT_FLOAT) {
+    acc_probs = gpu_mem_allocator.allocate_instance<float>(acc_probs_size);
+  } else if (data_type == DT_HALF) {
+    acc_probs = gpu_mem_allocator.allocate_instance<half>(acc_probs_size);
+  } else {
+    assert(false);
+  }
+
+  block_start_index =
+      gpu_mem_allocator.allocate_instance<int>(block_start_index_size);
+  request_id = gpu_mem_allocator.allocate_instance<int>(request_id_size);
+  tokens_per_request =
+      gpu_mem_allocator.allocate_instance<int>(tokens_per_request_size);
+}
+
+BeamTopKMeta::~BeamTopKMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
+}
+}; // namespace FlexFlow
diff --git a/src/ops/cast.cc b/src/ops/cast.cc
index 25f8e168b1..d98a54fe62 100644
--- a/src/ops/cast.cc
+++ b/src/ops/cast.cc
@@ -146,6 +146,44 @@ void Cast::init(FFModel const &ff) {
   set_opmeta_from_futuremap(ff, fm);
 }
 
+void Cast::init_inference(FFModel const &ff,
+                          std::vector<ParallelTensor> const &batch_inputs,
+                          std::vector<ParallelTensor> const &batch_outputs,
+                          MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+
+  IndexLauncher launcher(CAST_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Cast)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 OpMeta *Cast::init_task(Task const *task,
                         std::vector<PhysicalRegion> const &regions,
                         Context ctx,
@@ -186,6 +224,42 @@ void Cast::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap Cast::inference(FFModel const &ff,
+                          BatchConfigFuture const &bc,
+                          std::vector<ParallelTensor> const &batch_inputs,
+                          std::vector<ParallelTensor> const &batch_outputs,
+                          MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+
+  IndexLauncher launcher(CAST_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, false),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 template <typename IDT>
 void Cast::forward_task_with_1_type(Task const *task,
                                     std::vector<PhysicalRegion> const &regions,
diff --git a/src/ops/conv_2d.cc b/src/ops/conv_2d.cc
index 786c3427e9..ce7b6ebc01 100644
--- a/src/ops/conv_2d.cc
+++ b/src/ops/conv_2d.cc
@@ -1012,6 +1012,7 @@ bool Conv2D::estimate_sync_cost(Simulator *sim,
 
 void Conv2D::serialize(Legion::Serializer &sez) const {
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
   sez.serialize(this->out_channels);
   sez.serialize(this->kernel_h);
   sez.serialize(this->kernel_w);
@@ -1036,9 +1037,10 @@ Node Conv2D::deserialize(FFModel &ff,
       padding_w, groups;
   bool use_bias;
   ActiMode activation;
-  size_t id;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
   dez.deserialize(out_channels);
   dez.deserialize(kernel_h);
   dez.deserialize(kernel_w);
diff --git a/src/ops/element_binary.cc b/src/ops/element_binary.cc
index dfe60fcee8..2cd5ba100e 100644
--- a/src/ops/element_binary.cc
+++ b/src/ops/element_binary.cc
@@ -45,8 +45,11 @@ Tensor FFModel::binary(OperatorType op,
   assert(broadcastable(in1, in2));
   if (in1->data_type < in2->data_type) {
     dtype = in2->data_type;
-    std::string str(name);
-    Tensor new_in1 = cast(in1, dtype, (str + "input1_pre_cast").c_str());
+    std::string str;
+    if (name != nullptr) {
+      str = std::string(name) + "input1_pre_cast";
+    }
+    Tensor new_in1 = cast(in1, dtype, str.c_str());
     ele = new Layer(this,
                     op,
                     dtype,
@@ -58,8 +61,11 @@ Tensor FFModel::binary(OperatorType op,
                     in2);
   } else if (in1->data_type > in2->data_type) {
     dtype = in1->data_type;
-    std::string str(name);
-    Tensor new_in2 = cast(in2, dtype, (str + "input2_pre_cast").c_str());
+    std::string str;
+    if (name != nullptr) {
+      str = std::string(name) + "input2_pre_cast";
+    }
+    Tensor new_in2 = cast(in2, dtype, str.c_str());
     ele = new Layer(this,
                     op,
                     dtype,
@@ -97,8 +103,13 @@ Op *ElementBinary::create_operator_from_layer(
   long long value;
   layer->get_int_property("inplace_a", value);
   bool inplace_a = (bool)value;
-  return new ElementBinary(
-      model, layer->op_type, inputs[0], inputs[1], inplace_a, layer->name);
+  return new ElementBinary(model,
+                           layer->layer_guid,
+                           layer->op_type,
+                           inputs[0],
+                           inputs[1],
+                           inplace_a,
+                           layer->name);
 }
 
 Tensor FFModel::add(const Tensor in1,
@@ -166,10 +177,12 @@ bool ElementBinaryParams::is_valid(
 
 bool operator==(ElementBinaryParams const &lhs,
                 ElementBinaryParams const &rhs) {
-  return lhs.type == rhs.type;
+  return lhs.type == rhs.type && lhs.layer_guid == rhs.layer_guid &&
+         lhs.inplace_a == rhs.inplace_a;
 }
 
 ElementBinary::ElementBinary(FFModel &model,
+                             LayerID const &_layer_guid,
                              OperatorType _op_type,
                              const ParallelTensor in1,
                              const ParallelTensor in2,
@@ -185,6 +198,8 @@ ElementBinary::ElementBinary(FFModel &model,
          in1,
          in2),
       inplace_a(_inplace_a) {
+  // overwrite layer_guid
+  layer_guid = _layer_guid;
   numOutputs = 1;
   numWeights = 0;
   assert(in1->data_type == in2->data_type);
@@ -217,10 +232,14 @@ ElementBinary::ElementBinary(
     FFModel &model,
     ElementBinaryParams const &params,
     std::pair<ParallelTensor, ParallelTensor> const &inputs,
-    char const *name,
-    bool inplace_a)
-    : ElementBinary(
-          model, params.type, inputs.first, inputs.second, inplace_a, name) {}
+    char const *name)
+    : ElementBinary(model,
+                    params.layer_guid,
+                    params.type,
+                    inputs.first,
+                    inputs.second,
+                    params.inplace_a,
+                    name) {}
 
 void ElementBinary::map_output_tensors(FFModel &ff) {
   if (has_inplace_output()) {
@@ -260,6 +279,74 @@ void ElementBinary::do_inplace_output(void) {
   inplace_a = true;
 }
 
+void ElementBinary::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  // Check if we have the same oprands
+  has_same_operands = (batch_inputs[0]->region == batch_inputs[1]->region);
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(ELEMENTBINARY_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(ElementBinary)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  int rid = 0;
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(rid++, FID_DATA);
+  if (!has_same_operands) {
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_WRITE,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[1]->region));
+    launcher.add_field(rid++, FID_DATA);
+  } else {
+    assert(batch_inputs[0]->part == batch_inputs[1]->part);
+  }
+  if (!inplace_a) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(rid++, FID_DATA);
+  } else {
+    assert(batch_outputs[0]->part == batch_inputs[0]->part);
+    assert(batch_outputs[0]->region == batch_inputs[0]->region);
+  }
+  // launcher.add_region_requirement(
+  //   RegionRequirement(input_grad_lps[0], 0/*projection id*/,
+  //     WRITE_ONLY, EXCLUSIVE, inputs[0]->region_grad));
+  // launcher.add_field(3, FID_DATA);
+  // if (inputs[0]->region_grad != inputs[1]->region_grad) {
+  //  regions[4](I/O): input1_grad
+  //  launcher.add_region_requirement(
+  //    RegionRequirement(input_grad_lps[1], 0/*projection id*/,
+  //                      WRITE_ONLY, EXCLUSIVE, inputs[1]->region_grad));
+  //  launcher.add_field(4, FID_DATA);
+  //}
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void ElementBinary::init(FFModel const &ff) {
   // Check if we have the same oprands
   has_same_operands = (inputs[0]->region == inputs[1]->region);
@@ -327,7 +414,7 @@ OpMeta *ElementBinary::init_task(Task const *task,
                                  Runtime *runtime) {
   ElementBinary *eb = (ElementBinary *)task->args;
   FFHandler handle = *((FFHandler *)task->local_args);
-  ElementBinaryMeta *m = new ElementBinaryMeta(handle);
+  ElementBinaryMeta *m = new ElementBinaryMeta(handle, eb);
   for (int i = 0; i < eb->numInputs; i++) {
     m->trainableInputs[i] = eb->trainableInputs[i];
   }
@@ -438,6 +525,84 @@ void ElementBinary::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap
+    ElementBinary::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "ElementBinary op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(ELEMENTBINARY_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  if (inplace_a) {
+    assert(batch_outputs[0]->part == batch_inputs[0]->part);
+    assert(batch_outputs[0]->region == batch_inputs[0]->region);
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_WRITE,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    if (has_same_operands) {
+      // do nothing else
+    } else {
+      launcher.add_region_requirement(
+          RegionRequirement(batch_inputs[1]->part,
+                            0 /*projection id*/,
+                            READ_ONLY,
+                            EXCLUSIVE,
+                            batch_inputs[1]->region));
+      launcher.add_field(1, FID_DATA);
+    }
+  } else {
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    if (has_same_operands) {
+      launcher.add_region_requirement(
+          RegionRequirement(batch_outputs[0]->part,
+                            0 /*projection id*/,
+                            WRITE_ONLY,
+                            EXCLUSIVE,
+                            batch_outputs[0]->region));
+      launcher.add_field(1, FID_DATA);
+    } else {
+      launcher.add_region_requirement(
+          RegionRequirement(batch_inputs[1]->part,
+                            0 /*projection id*/,
+                            READ_ONLY,
+                            EXCLUSIVE,
+                            batch_inputs[1]->region));
+      launcher.add_field(1, FID_DATA);
+      launcher.add_region_requirement(
+          RegionRequirement(batch_outputs[0]->part,
+                            0 /*projection id*/,
+                            WRITE_ONLY,
+                            EXCLUSIVE,
+                            batch_outputs[0]->region));
+      launcher.add_field(2, FID_DATA);
+    }
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 /*
   regions[0](I): in1
   regions[1](I): in2
@@ -450,8 +615,11 @@ __host__ void
                                 Runtime *runtime) {
   // const ElementBinary* ele = (const ElementBinary*) task->args;
   ElementBinaryMeta const *m = *((ElementBinaryMeta **)task->local_args);
+  GenericTensorAccessorR in1, in2;
+  GenericTensorAccessorW out;
   Domain in1_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
+
   if (!m->has_same_operands) {
     Domain in2_domain = runtime->get_index_space_domain(
         ctx, task->regions[1].region.get_index_space());
@@ -461,53 +629,78 @@ __host__ void
              m->op_type == OP_EW_MUL);
     }
   }
-  float const *in1_ptr = NULL, *in2_ptr = NULL;
-  float *out_ptr = NULL;
+
   if (m->inplace_a) {
     if (m->has_same_operands) {
       assert(regions.size() == 1);
       assert(task->regions.size() == 1);
-      out_ptr = helperGetTensorPointerRW<float>(
-          regions[0], task->regions[0], FID_DATA, ctx, runtime);
-      in2_ptr = out_ptr;
-      in1_ptr = out_ptr;
+      out = helperGetGenericTensorAccessorRW(m->output_type[0],
+                                             regions[0],
+                                             task->regions[0],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
+      in2 = out;
+      in1 = out;
     } else {
       assert(regions.size() == 2);
       assert(task->regions.size() == 2);
-      out_ptr = helperGetTensorPointerRW<float>(
-          regions[0], task->regions[0], FID_DATA, ctx, runtime);
-      in2_ptr = helperGetTensorPointerRO<float>(
-          regions[1], task->regions[1], FID_DATA, ctx, runtime);
-      in1_ptr = out_ptr;
+      out = helperGetGenericTensorAccessorRW(m->output_type[0],
+                                             regions[0],
+                                             task->regions[0],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
+      in2 = helperGetGenericTensorAccessorRO(m->input_type[1],
+                                             regions[1],
+                                             task->regions[1],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
+      in1 = out;
     }
   } else {
     if (m->has_same_operands) {
       assert(regions.size() == 2);
       assert(task->regions.size() == 2);
-      Domain out_domain = runtime->get_index_space_domain(
-          ctx, task->regions[1].region.get_index_space());
-      // assert(out_domain == in1_domain);
-      in1_ptr = helperGetTensorPointerRO<float>(
-          regions[0], task->regions[0], FID_DATA, ctx, runtime);
-      in2_ptr = in1_ptr;
-      out_ptr = helperGetTensorPointerWO<float>(
-          regions[1], task->regions[1], FID_DATA, ctx, runtime);
+      in1 = helperGetGenericTensorAccessorRO(m->input_type[0],
+                                             regions[0],
+                                             task->regions[0],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
+      in2 = in1;
+      out = helperGetGenericTensorAccessorWO(m->output_type[0],
+                                             regions[1],
+                                             task->regions[1],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
     } else {
       assert(regions.size() == 3);
       assert(task->regions.size() == 3);
-      Domain out_domain = runtime->get_index_space_domain(
-          ctx, task->regions[2].region.get_index_space());
-      // assert(out_domain == in1_domain);
-      in1_ptr = helperGetTensorPointerRO<float>(
-          regions[0], task->regions[0], FID_DATA, ctx, runtime);
-      in2_ptr = helperGetTensorPointerRO<float>(
-          regions[1], task->regions[1], FID_DATA, ctx, runtime);
-      out_ptr = helperGetTensorPointerWO<float>(
-          regions[2], task->regions[2], FID_DATA, ctx, runtime);
+      in1 = helperGetGenericTensorAccessorRO(m->input_type[0],
+                                             regions[0],
+                                             task->regions[0],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
+      in2 = helperGetGenericTensorAccessorRO(m->input_type[1],
+                                             regions[1],
+                                             task->regions[1],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
+      out = helperGetGenericTensorAccessorWO(m->output_type[0],
+                                             regions[2],
+                                             task->regions[2],
+                                             FID_DATA,
+                                             ctx,
+                                             runtime);
     }
   }
 
-  forward_kernel_wrapper(m, in1_ptr, in2_ptr, out_ptr);
+  forward_kernel_wrapper(m, in1, in2, out);
 }
 
 void ElementBinary::backward(FFModel const &ff) {
@@ -709,7 +902,7 @@ bool ElementBinary::measure_operator_cost(Simulator *sim,
   if (!inputs[1]->get_sub_tensor(mv, sub_input2)) {
     return false;
   }
-  ElementBinaryMeta *m = sim->ele_binary_meta;
+  ElementBinaryMeta *m = new ElementBinaryMeta(sim->handler, this);
   m->op_type = op_type;
   m->profiling = this->profiling;
   m->inplace_a = this->inplace_a;
@@ -725,8 +918,12 @@ bool ElementBinary::measure_operator_cost(Simulator *sim,
   sim->free_all();
   float *input1_ptr = (float *)sim->allocate(sub_input1.get_volume(), DT_FLOAT);
   assert(input1_ptr != NULL);
+  GenericTensorAccessorR input1_acc(
+      inputs[0]->data_type, input1_domain, input1_ptr);
   float *input2_ptr = (float *)sim->allocate(sub_input2.get_volume(), DT_FLOAT);
   assert(input2_ptr != NULL);
+  GenericTensorAccessorR input2_acc(
+      inputs[1]->data_type, input2_domain, input2_ptr);
   cost_metrics.inputs_memory += cost_metrics.total_mem_diff_from(sim->offset);
 
   float *output_ptr = NULL;
@@ -736,13 +933,15 @@ bool ElementBinary::measure_operator_cost(Simulator *sim,
     output_ptr = (float *)sim->allocate(sub_output.get_volume(), DT_FLOAT);
   }
   assert(output_ptr != NULL);
+  GenericTensorAccessorW output_acc(
+      outputs[0]->data_type, output_domain, output_ptr);
   cost_metrics.outputs_memory += cost_metrics.total_mem_diff_from(sim->offset);
 
   assert(m->profiling == false);
 
   std::function<void()> forward, backward;
   forward = [&] {
-    forward_kernel_wrapper(m, input1_ptr, input2_ptr, output_ptr);
+    forward_kernel_wrapper(m, input1_acc, input2_acc, output_acc);
   };
   if (sim->computationMode == COMP_MODE_TRAINING) {
     float *input1_grad_ptr =
@@ -791,12 +990,45 @@ bool ElementBinary::measure_operator_cost(Simulator *sim,
                       cost_metrics.forward_time);
   }
 
+  delete m;
   return true;
 }
 
+void ElementBinary::serialize(Legion::Serializer &sez) const {
+  sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
+  sez.serialize(this->op_type);
+  sez.serialize(this->inplace_a);
+}
+
+using PCG::Node;
+/*static*/
+Node ElementBinary::deserialize(FFModel &ff,
+                                Legion::Deserializer &dez,
+                                ParallelTensor inputs[],
+                                int num_inputs) {
+  assert(num_inputs == 2);
+  OperatorType op_type;
+  size_t id, transformer_layer_id;
+  bool inplace_a;
+  dez.deserialize(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
+  dez.deserialize(op_type);
+  dez.deserialize(inplace_a);
+
+  ElementBinaryParams params;
+  params.layer_guid = layer_guid;
+  params.type = op_type;
+  params.inplace_a = inplace_a;
+  return ff.get_or_create_node<ElementBinary>({inputs[0], inputs[1]}, params);
+}
+
 ElementBinaryParams ElementBinary::get_params() const {
   ElementBinaryParams params;
+  params.layer_guid = this->layer_guid;
   params.type = this->op_type;
+  params.inplace_a = this->inplace_a;
   return params;
 }
 
@@ -806,7 +1038,9 @@ namespace std {
 size_t hash<FlexFlow::ElementBinaryParams>::operator()(
     FlexFlow::ElementBinaryParams const &params) const {
   size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
   hash_combine(key, params.type);
+  hash_combine(key, params.inplace_a);
   return key;
 }
 }; // namespace std
diff --git a/src/ops/element_unary.cc b/src/ops/element_unary.cc
index 252f66b7e8..5ecb812b68 100644
--- a/src/ops/element_unary.cc
+++ b/src/ops/element_unary.cc
@@ -27,11 +27,11 @@ Tensor FFModel::unary(OperatorType op,
                       char const *name,
                       float scalar) {
   Layer *ele = nullptr;
-  DataType dtype;
-  // FIXME: currently cast input to float if it has a lower type
-  if (x->data_type < DT_FLOAT) {
+  DataType dtype = x->data_type;
+  // if (x->data_type < DT_FLOAT) {
+  if (false) {
     dtype = DT_FLOAT;
-    std::string str(name);
+    std::string str = nullptr ? "" : std::string(name);
     Tensor new_x = cast(x, dtype, (str + "input_pre_cast").c_str());
     ele = new Layer(this,
                     op,
@@ -298,6 +298,56 @@ void ElementUnary::init(FFModel const &ff) {
   set_opmeta_from_futuremap(ff, fm);
 }
 
+void ElementUnary::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher init_launcher(ELEMENTUNARY_INIT_TASK_ID,
+                              parallel_is,
+                              TaskArgument(this, sizeof(ElementUnary)),
+                              argmap,
+                              Predicate::TRUE_PRED,
+                              false /*must*/,
+                              0 /*mapper_id*/,
+                              machine_view_hash);
+  if (!inplace) {
+    init_launcher.add_region_requirement(
+        RegionRequirement(batch_inputs[0]->part,
+                          0 /*projection id*/,
+                          READ_ONLY,
+                          EXCLUSIVE,
+                          batch_inputs[0]->region));
+    init_launcher.add_field(0, FID_DATA);
+    init_launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    init_launcher.add_field(1, FID_DATA);
+  } else {
+    init_launcher.add_region_requirement(
+        RegionRequirement(batch_inputs[0]->part,
+                          0 /*projection id*/,
+                          READ_WRITE,
+                          EXCLUSIVE,
+                          batch_inputs[0]->region));
+    init_launcher.add_field(0, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, init_launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 OpMeta *ElementUnary::init_task(Task const *task,
                                 std::vector<PhysicalRegion> const &regions,
                                 Context ctx,
@@ -368,12 +418,64 @@ void ElementUnary::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap
+    ElementUnary::inference(FFModel const &ff,
+                            BatchConfigFuture const &bc,
+                            std::vector<ParallelTensor> const &batch_inputs,
+                            std::vector<ParallelTensor> const &batch_outputs,
+                            MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+
+  IndexLauncher launcher(ELEMENTUNARY_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  if (inplace) {
+    assert(batch_outputs[0]->part == batch_inputs[0]->part);
+    assert(batch_outputs[0]->region == batch_inputs[0]->region);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          READ_WRITE,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+  } else {
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(1, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void ElementUnary::forward_task(Task const *task,
                                 std::vector<PhysicalRegion> const &regions,
                                 Context ctx,
                                 Runtime *runtime) {
   ElementUnaryMeta const *m = *((ElementUnaryMeta **)task->local_args);
-  if (m->data_type == DT_FLOAT) {
+  if (m->data_type == DT_HALF) {
+    forward_task_with_type<half>(task, regions, ctx, runtime);
+  } else if (m->data_type == DT_FLOAT) {
     forward_task_with_type<float>(task, regions, ctx, runtime);
   } else if (m->data_type == DT_DOUBLE) {
     forward_task_with_type<double>(task, regions, ctx, runtime);
@@ -570,6 +672,7 @@ void ElementUnary::serialize(Legion::Serializer &sez) const {
   sez.serialize(this->inplace);
   sez.serialize(scalar);
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
 }
 
 bool ElementUnary::measure_operator_cost(Simulator *sim,
@@ -680,9 +783,10 @@ Node ElementUnary::deserialize(FFModel &ff,
   dez.deserialize(op_type);
   dez.deserialize(inplace);
   dez.deserialize(scalar);
-  size_t id;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
 
   ElementUnaryParams params;
   params.op_type = op_type;
diff --git a/src/ops/element_unary.cpp b/src/ops/element_unary.cpp
index 43c84b0c41..424e739e13 100644
--- a/src/ops/element_unary.cpp
+++ b/src/ops/element_unary.cpp
@@ -45,10 +45,11 @@ void ElementUnary::init_kernel(ElementUnaryMeta *m,
       assert(false);
   }
   checkCUDNN(miopenSetActivationDescriptor(m->actiDesc, mode, 0.0, 0.0, 0.0));
-  checkCUDNN(cudnnSetTensorDescriptorFromDomain(m->inputTensor, input_domain));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->inputTensor, input_domain, m->data_type));
   // input_domain == output_domain
-  checkCUDNN(
-      cudnnSetTensorDescriptorFromDomain(m->outputTensor, output_domain));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->outputTensor, output_domain, m->data_type));
 }
 
 template <typename T>
@@ -81,7 +82,9 @@ __global__ void elewise_unary_forward_kernel(
         break;
       }
       case OP_GELU: {
-        out[i] = (T)(in[i] * 0.5 * erfc(-in[i] * M_SQRT1_2));
+        out[i] = (T)(in[i] * static_cast<T>(0.5f) *
+                     static_cast<T>(erfc(static_cast<float>(
+                         -in[i] * static_cast<T>(M_SQRT1_2)))));
         break;
       }
       case OP_RSQRT: {
@@ -189,7 +192,7 @@ __global__ void elewise_unary_backward_kernel(coord_t volume,
       case OP_GELU: {
         input_grad[i] =
             (T)(output_grad[i] *
-                (0.5 * erfc(-input[i] * M_SQRT1_2) -
+                (0.5 * static_cast<T>(erfc(-input[i] * M_SQRT1_2)) -
                  0.5 * M_SQRT1_2 * input[i] * exp(-input[i] * input[i] * 0.5)));
         break;
       }
@@ -284,6 +287,11 @@ ElementUnaryMeta::ElementUnaryMeta(FFHandler handler) : OpMeta(handler) {
   checkCUDNN(miopenCreateActivationDescriptor(&actiDesc));
 }
 
+template void
+    ElementUnary::forward_kernel_wrapper<half>(ElementUnaryMeta const *m,
+                                               half const *input_ptr,
+                                               half *output_ptr,
+                                               size_t num_elements);
 template void
     ElementUnary::forward_kernel_wrapper<float>(ElementUnaryMeta const *m,
                                                 float const *input_ptr,
diff --git a/src/ops/element_unary.cu b/src/ops/element_unary.cu
index d6e5bcfdc3..4a38dabe52 100644
--- a/src/ops/element_unary.cu
+++ b/src/ops/element_unary.cu
@@ -45,10 +45,11 @@ void ElementUnary::init_kernel(ElementUnaryMeta *m,
   }
   checkCUDNN(cudnnSetActivationDescriptor(
       m->actiDesc, mode, CUDNN_PROPAGATE_NAN, 0.0));
-  checkCUDNN(cudnnSetTensorDescriptorFromDomain(m->inputTensor, input_domain));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->inputTensor, input_domain, m->data_type));
   // input_domain == output_domain
-  checkCUDNN(
-      cudnnSetTensorDescriptorFromDomain(m->outputTensor, output_domain));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->outputTensor, output_domain, m->data_type));
 }
 
 template <typename T>
@@ -81,7 +82,9 @@ __global__ void elewise_unary_forward_kernel(
         break;
       }
       case OP_GELU: {
-        out[i] = (T)(in[i] * 0.5 * erfc(-in[i] * M_SQRT1_2));
+        out[i] = (T)(in[i] * static_cast<T>(0.5f) *
+                     static_cast<T>(erfc(static_cast<float>(
+                         -in[i] * static_cast<T>(M_SQRT1_2)))));
         break;
       }
       case OP_RSQRT: {
@@ -202,7 +205,7 @@ __global__ void elewise_unary_backward_kernel(coord_t volume,
       case OP_GELU: {
         input_grad[i] =
             (T)(output_grad[i] *
-                (0.5 * erfc(-input[i] * M_SQRT1_2) -
+                (0.5 * static_cast<T>(erfc(-input[i] * M_SQRT1_2)) -
                  0.5 * M_SQRT1_2 * input[i] * exp(-input[i] * input[i] * 0.5)));
         break;
       }
@@ -293,6 +296,11 @@ ElementUnaryMeta::ElementUnaryMeta(FFHandler handler) : OpMeta(handler) {
   checkCUDNN(cudnnCreateActivationDescriptor(&actiDesc));
 }
 
+template void
+    ElementUnary::forward_kernel_wrapper<half>(ElementUnaryMeta const *m,
+                                               half const *input_ptr,
+                                               half *output_ptr,
+                                               size_t num_elements);
 template void
     ElementUnary::forward_kernel_wrapper<float>(ElementUnaryMeta const *m,
                                                 float const *input_ptr,
@@ -313,7 +321,6 @@ template void
                                                   int64_t const *input_ptr,
                                                   int64_t *output_ptr,
                                                   size_t num_elements);
-
 template void
     ElementUnary::backward_kernel_wrapper<float>(ElementUnaryMeta const *m,
                                                  float const *input_ptr,
diff --git a/src/ops/embedding.cc b/src/ops/embedding.cc
index 3b53213b91..409dcb398e 100644
--- a/src/ops/embedding.cc
+++ b/src/ops/embedding.cc
@@ -369,6 +369,45 @@ void Embedding::init(FFModel const &ff) {
   set_opmeta_from_futuremap(ff, fm);
 }
 
+void Embedding::init_inference(FFModel const &ff,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(EMBED_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Embedding)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // regions[2]: weight
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 OpMeta *Embedding::init_task(Task const *task,
                              std::vector<PhysicalRegion> const &regions,
                              Context ctx,
@@ -419,6 +458,53 @@ void Embedding::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap Embedding::inference(FFModel const &ff,
+                               BatchConfigFuture const &bc,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+
+  IndexLauncher launcher(EMBED_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // regions[0]: input
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // regions[1]: output
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region,
+                                                    MAP_TO_ZC_MEMORY));
+  launcher.add_field(1, FID_DATA);
+  // regions[2]: weight
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(2, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 /*
   regions[0](I): input
   regions[1](O): output
diff --git a/src/ops/experts.cc b/src/ops/experts.cc
new file mode 100644
index 0000000000..c8b0ec0f26
--- /dev/null
+++ b/src/ops/experts.cc
@@ -0,0 +1,1159 @@
+/* Copyright 2022 CMU
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/experts.h"
+#ifdef INFERENCE_TESTS
+#include "flexflow/utils/cuda_helper.h"
+#endif
+#include "legion/legion_utilities.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+static constexpr int KERNEL_IDX = 0;
+static constexpr int BIAS_IDX = 1;
+#ifdef INFERENCE_TESTS
+static bool DEBUG_MODE = false;
+#endif
+
+// For now, we use one input and one output per expert
+Tensor FFModel::experts(Tensor const *inputs,
+                        int num_experts,
+                        int experts_start_idx,
+                        int experts_output_dim_size,
+                        float alpha,
+                        int experts_num_layers,
+                        int experts_internal_dim_size,
+                        char const *name) {
+
+  // Check that there are three inputs: the input tensor, the indices and the
+  // topk_gate_preds
+  assert(inputs[0] != nullptr);
+  int num_dims = inputs[0]->num_dims;
+  assert(inputs[1]->num_dims == num_dims);
+  assert(inputs[2]->num_dims == num_dims);
+  int topk = inputs[1]->dims[0];
+  assert(inputs[2]->dims[0] == topk);
+  for (int i = 1; i < num_dims; i++) {
+    assert(inputs[0]->dims[i] == inputs[1]->dims[i]);
+    assert(inputs[1]->dims[i] == inputs[2]->dims[i]);
+  }
+
+  assert(inputs[1]->data_type == DT_INT32 || inputs[1]->data_type == DT_INT64);
+
+  assert(experts_num_layers >= 1);
+  assert(experts_num_layers <= 2 && "Multi-layer experts not implemented yet.");
+  assert(experts_num_layers == 1 || experts_internal_dim_size > 0);
+
+  // parameters for the FFN implementing the experts. We can make these
+  // FFModel::experts(...) function parameters if needed.
+  bool use_bias = true;
+  ActiMode activation = AC_MODE_RELU;
+
+  Layer *e = new Layer(this,
+                       OP_EXPERTS,
+                       DT_FLOAT,
+                       name,
+                       3 /*inputs*/,
+                       (1 + use_bias) /*weights*/,
+                       1 /*outputs*/,
+                       inputs);
+  {
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 1; i < num_dims; i++) {
+      dims[i] = inputs[0]->dims[i];
+    }
+    dims[0] = experts_output_dim_size;
+    e->outputs[0] = create_tensor_legion_ordering(
+        num_dims, dims, DT_FLOAT, e, 0, true /*create_grad*/);
+    assert(e->outputs[0] != nullptr);
+  }
+  {
+    int nparams = (experts_num_layers == 1)
+                      ? (inputs[0]->dims[0] * experts_output_dim_size)
+                      : experts_internal_dim_size *
+                            (inputs[0]->dims[0] + experts_output_dim_size);
+    int dims[2] = {nparams, num_experts};
+    e->weights[0] = create_weight_legion_ordering(
+        2, dims, DT_FLOAT, e, true /*create_grad*/, nullptr, CHOSEN_SYNC_TYPE);
+  }
+  if (use_bias) {
+    int nparams = (experts_num_layers == 1)
+                      ? experts_output_dim_size
+                      : (experts_internal_dim_size + experts_output_dim_size);
+    int dims[2] = {nparams, num_experts};
+    e->weights[1] = create_weight_legion_ordering(
+        2, dims, DT_FLOAT, e, true /*create_grad*/, nullptr, CHOSEN_SYNC_TYPE);
+  }
+
+  e->add_int_property("num_experts", num_experts);
+  e->add_int_property("experts_start_idx", experts_start_idx);
+  e->add_int_property("experts_output_dim_size", experts_output_dim_size);
+  e->add_float_property("alpha", alpha);
+  e->add_int_property("experts_num_layers", experts_num_layers);
+  e->add_int_property("experts_internal_dim_size", experts_internal_dim_size);
+  e->add_int_property("use_bias", use_bias);
+  e->add_int_property("activation", activation);
+  layers.push_back(e);
+
+  return e->outputs[0];
+}
+
+Op *Experts::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  long long value;
+  layer->get_int_property("num_experts", value);
+  int num_experts = value;
+  layer->get_int_property("experts_start_idx", value);
+  int experts_start_idx = value;
+  layer->get_int_property("experts_output_dim_size", value);
+  int experts_output_dim_size = value;
+  float value2;
+  layer->get_float_property("alpha", value2);
+  float alpha = value2;
+  layer->get_int_property("experts_num_layers", value);
+  int experts_num_layers = value;
+  layer->get_int_property("experts_internal_dim_size", value);
+  int experts_internal_dim_size = value;
+  layer->get_int_property("use_bias", value);
+  bool use_bias = (bool)value;
+  layer->get_int_property("activation", value);
+  ActiMode activation = (ActiMode)value;
+  return new Experts(model,
+                     layer->layer_guid,
+                     inputs.data(),
+                     num_experts,
+                     experts_start_idx,
+                     experts_output_dim_size,
+                     alpha,
+                     experts_num_layers,
+                     experts_internal_dim_size,
+                     use_bias,
+                     activation,
+                     false /*allocate_weights*/,
+                     layer->name);
+}
+
+ExpertsParams Experts::get_params() const {
+  ExpertsParams params;
+  params.layer_guid = this->layer_guid;
+  params.num_experts = num_experts;
+  params.experts_start_idx = experts_start_idx;
+  params.experts_output_dim_size = experts_output_dim_size;
+  params.alpha = alpha;
+  params.experts_num_layers = experts_num_layers;
+  params.experts_internal_dim_size = experts_internal_dim_size;
+  params.use_bias = use_bias;
+  params.activation = activation;
+  return params;
+}
+
+bool ExpertsParams::is_valid(
+    std::vector<ParallelTensorShape> const &inputs) const {
+  if (inputs.size() != 3) {
+    printf("Number of inputs to the Experts layer is wrong\n");
+    return false;
+  }
+  if (!inputs[0].is_valid()) {
+    printf("The first tensor passed to the Experts layer is not valid\n");
+    return false;
+  }
+  if (!inputs[1].is_valid()) {
+    printf("The second tensor passed to the Experts layer is not valid\n");
+    return false;
+  }
+  if (!inputs[2].is_valid()) {
+    printf("The third tensor passed to the Experts layer is not valid\n");
+    return false;
+  }
+  if (inputs[0].num_dims != inputs[1].num_dims ||
+      inputs[1].num_dims != inputs[2].num_dims) {
+    printf("Mismatch found between the number of dimensions of the three input "
+           "tensors for the Expert layer\n");
+    return false;
+  }
+  if (inputs[0].data_type != DT_FLOAT) {
+    printf("Data type of the first input to the Experts layer is wrong!\n");
+    return false;
+  }
+  if (inputs[1].data_type != DT_INT32 && inputs[1].data_type != DT_INT64) {
+    printf("Data type of the second input to the Experts layer is wrong!\n");
+    return false;
+  }
+  if (inputs[2].data_type != DT_FLOAT) {
+    printf("Data type of the third input to the Experts layer is wrong!\n");
+    return false;
+  }
+  if (inputs[1].dims[0] != inputs[2].dims[0]) {
+    printf(
+        "Dimension mismatch between indices and topk_gate_preds tensors passed "
+        "to the Experts layer.\n");
+    return false;
+  }
+  for (int i = 1; i < inputs[0].num_dims; i++) {
+    if (inputs[0].dims[i] != inputs[1].dims[i] ||
+        inputs[1].dims[i] != inputs[2].dims[i]) {
+      printf("Dimension mismatch among the input tensors passed to the Experts "
+             "layer.\n");
+      return false;
+    }
+  }
+  return true;
+}
+
+bool operator==(ExpertsParams const &lhs, ExpertsParams const &rhs) {
+  return lhs.layer_guid == rhs.layer_guid &&
+         lhs.num_experts == rhs.num_experts &&
+         lhs.experts_start_idx == rhs.experts_start_idx &&
+         lhs.experts_output_dim_size == rhs.experts_output_dim_size &&
+         lhs.alpha == rhs.alpha &&
+         lhs.experts_num_layers == rhs.experts_num_layers &&
+         lhs.experts_internal_dim_size == rhs.experts_internal_dim_size &&
+         lhs.use_bias == rhs.use_bias && lhs.activation == rhs.activation;
+}
+
+Experts::Experts(FFModel &model,
+                 ExpertsParams const &params,
+                 std::vector<ParallelTensor> const &inputs,
+                 bool allocate_weights,
+                 char const *name)
+    : Experts(model,
+              params.layer_guid,
+              inputs.data(),
+              params.num_experts,
+              params.experts_start_idx,
+              params.experts_output_dim_size,
+              params.alpha,
+              params.experts_num_layers,
+              params.experts_internal_dim_size,
+              params.use_bias,
+              params.activation,
+              allocate_weights,
+              name) {}
+
+Experts::Experts(FFModel &model,
+                 LayerID const &_layer_guid,
+                 ParallelTensor const *inputs,
+                 int _num_experts,
+                 int _experts_start_idx,
+                 int _experts_output_dim_size,
+                 float _alpha,
+                 int _experts_num_layers,
+                 int _experts_internal_dim_size,
+                 bool _use_bias,
+                 ActiMode _activation,
+                 bool allocate_weights,
+                 char const *name)
+    : Op(model,
+         OP_EXPERTS,
+         DT_FLOAT,
+         name,
+         3 /*inputs*/,
+         (1 + _use_bias) /*weights*/,
+         1 /*outputs*/,
+         inputs),
+      num_experts(_num_experts), experts_start_idx(_experts_start_idx),
+      experts_output_dim_size(_experts_output_dim_size), alpha(_alpha),
+      experts_num_layers(_experts_num_layers),
+      experts_internal_dim_size(_experts_internal_dim_size),
+      use_bias(_use_bias), activation(_activation) {
+
+  // overwrite layer_guid
+  layer_guid = _layer_guid;
+
+  // Check number of inputs, output, weights
+  assert(num_experts > 0);
+  assert(numInputs == 3);
+  assert(numOutputs == 1);
+  assert(numWeights == (1 + use_bias));
+
+  // Check input dimensions
+  int num_dims = inputs[0]->num_dims;
+  int topk = inputs[1]->dims[0].size;
+  assert(inputs[0] != nullptr);
+  assert(inputs[1]->num_dims == num_dims);
+  assert(inputs[2]->num_dims == num_dims);
+  assert(inputs[2]->dims[0].size == topk);
+  for (int i = 1; i < num_dims; i++) {
+    assert(inputs[0]->dims[i] == inputs[1]->dims[i]);
+    assert(inputs[1]->dims[i] == inputs[2]->dims[i]);
+  }
+  // Assume that we don't parallelize the channel dim of input
+  // nor the expert_assigned dim of indices
+  assert(inputs[0]->dims[0].degree == 1);
+  assert(inputs[1]->dims[0].degree == 1);
+  assert(inputs[2]->dims[0].degree == 1);
+  // check data type of indices input
+  assert(inputs[1]->data_type == DT_INT32 || inputs[1]->data_type == DT_INT64);
+  assert(experts_num_layers >= 1);
+  assert(experts_num_layers <= 2 && "Multi-layer experts not implemented yet.");
+  assert(experts_num_layers == 1 || experts_internal_dim_size > 0);
+
+  // save the token embedding dimension (data_dim) and the effective batch size
+  data_dim = inputs[0]->dims[0].size;
+  effective_batch_size = 1;
+  for (int i = 1; i <= num_dims - 2; i++) {
+    effective_batch_size *= inputs[0]->dims[i].size;
+  }
+  num_chosen_experts = topk;
+
+  out_dim = _experts_output_dim_size;
+
+  // Create the parallel tensor for the output
+  ParallelDim out_dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < num_dims; i++) {
+    out_dims[i] = inputs[0]->dims[i];
+  }
+  out_dims[0].size = experts_output_dim_size;
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      num_dims, out_dims, inputs[0]->data_type, this, 0 /*owner_idx*/);
+  assert(outputs[0] != nullptr);
+
+  if (allocate_weights) {
+    {
+      ParallelDim dims[3];
+      int nparams = (experts_num_layers == 1)
+                        ? (data_dim * experts_output_dim_size)
+                        : experts_internal_dim_size *
+                              (data_dim + experts_output_dim_size);
+      dims[0].size = nparams;
+      dims[0].degree = 1;
+      dims[0].parallel_idx = -1;
+      dims[1] = inputs[0]->dims[num_dims - 1];
+      dims[1].size = num_experts;
+      dims[2] = inputs[0]->dims[num_dims - 2];
+      dims[2].size = dims[0].degree;
+      Initializer *kernel_initializer = new GlorotUniform(std::rand() /*seed*/);
+      // assert(kernel_shape.dims[2].size == num_experts);
+      weights[0] =
+          model.create_parallel_weight_legion_ordering(3,
+                                                       dims,
+                                                       DT_FLOAT,
+                                                       NULL /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       kernel_initializer,
+                                                       CHOSEN_SYNC_TYPE);
+      assert(weights[0] != nullptr);
+    }
+    if (use_bias) {
+      Initializer *bias_initializer = new ZeroInitializer();
+      // assert(bias_shape.dims[1].size == num_experts);
+      ParallelDim dims[3];
+      int nparams = (experts_num_layers == 1)
+                        ? experts_output_dim_size
+                        : (experts_internal_dim_size + experts_output_dim_size);
+      dims[0].size = nparams;
+      dims[0].degree = 1;
+      dims[0].parallel_idx = -1;
+      dims[1] = inputs[0]->dims[num_dims - 1];
+      dims[1].size = num_experts;
+      dims[2] = inputs[0]->dims[num_dims - 2];
+      dims[2].size = dims[0].degree;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(3,
+                                                       dims,
+                                                       DT_FLOAT,
+                                                       NULL /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       bias_initializer,
+                                                       CHOSEN_SYNC_TYPE);
+      assert(weights[1] != nullptr);
+    }
+  }
+  assert(check_output_input_weight_parallel_dims(allocate_weights));
+}
+
+void Experts::serialize(Legion::Serializer &sez) const {
+  ExpertsParams params = get_params();
+  sez.serialize(params.layer_guid.id);
+  sez.serialize(params.layer_guid.transformer_layer_id);
+  sez.serialize(params.num_experts);
+  sez.serialize(params.experts_start_idx);
+  sez.serialize(params.experts_output_dim_size);
+  sez.serialize(params.alpha);
+  sez.serialize(params.experts_num_layers);
+  sez.serialize(params.experts_internal_dim_size);
+  sez.serialize(params.use_bias);
+  sez.serialize(params.activation);
+}
+
+using PCG::Node;
+Node Experts::deserialize(FFModel &ff,
+                          Legion::Deserializer &dez,
+                          std::vector<ParallelTensor> const &inputs,
+                          int num_inputs) {
+  int num_experts, experts_start_idx, experts_output_dim_size,
+      experts_num_layers, experts_internal_dim_size;
+  float alpha;
+  ActiMode activation;
+  bool use_bias;
+  size_t id, transformer_layer_id;
+  dez.deserialize(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
+  dez.deserialize(num_experts);
+  dez.deserialize(experts_start_idx);
+  dez.deserialize(experts_output_dim_size);
+  dez.deserialize(alpha);
+  dez.deserialize(experts_num_layers);
+  dez.deserialize(experts_internal_dim_size);
+  dez.deserialize(use_bias);
+  dez.deserialize(activation);
+
+  assert(num_inputs == 3);
+
+  ExpertsParams params;
+  params.layer_guid = layer_guid;
+  params.num_experts = num_experts;
+  params.experts_start_idx = experts_start_idx;
+  params.experts_output_dim_size = experts_output_dim_size;
+  params.alpha = alpha;
+  params.experts_num_layers = experts_num_layers;
+  params.experts_internal_dim_size = experts_internal_dim_size;
+  params.use_bias = use_bias;
+  params.activation = activation;
+
+  return ff.get_or_create_node<Experts>(inputs, params);
+}
+
+void Experts::init_inference(FFModel const &ff,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(EXPERTS_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Experts)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // expert predictions
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // expert assignment indices
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  // topk_gate_preds
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[2]->region));
+  launcher.add_field(2, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(3, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(4, FID_DATA);
+  if (use_bias) {
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(5, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void Experts::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(EXPERTS_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Experts)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  // expert predictions
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // expert assignment indices
+  launcher.add_region_requirement(RegionRequirement(inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  // topk_gate_preds
+  launcher.add_region_requirement(RegionRequirement(inputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[2]->region));
+  launcher.add_field(2, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(3, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(4, FID_DATA);
+  if (use_bias) {
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(5, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+OpMeta *Experts::init_task(Task const *task,
+                           std::vector<PhysicalRegion> const &regions,
+                           Context ctx,
+                           Runtime *runtime) {
+  Experts const *exp = (Experts *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+  ExpertsMeta *m = new ExpertsMeta(handle,
+                                   exp->num_experts,
+                                   exp->experts_start_idx,
+                                   exp->data_dim,
+                                   exp->out_dim,
+                                   exp->experts_num_layers,
+                                   exp->experts_internal_dim_size,
+                                   exp->effective_batch_size,
+                                   exp->num_chosen_experts,
+                                   exp->alpha,
+                                   exp->use_bias,
+                                   exp->activation);
+  m->profiling = exp->profiling;
+  return m;
+}
+
+void Experts::forward(FFModel const &ff) {
+  // assert(false && "Experts is designed for inference only");
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_forward(ff, argmap);
+  IndexLauncher launcher(EXPERTS_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  // expert predictions
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // expert assignment indices
+  launcher.add_region_requirement(RegionRequirement(inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  // topk_gate_preds
+  launcher.add_region_requirement(RegionRequirement(inputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[2]->region));
+  launcher.add_field(2, FID_DATA);
+  // expert output per token (only the chosen experts have non-zero
+  // contributions)
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(3, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(4, FID_DATA);
+  if (use_bias) {
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(5, FID_DATA);
+  }
+  runtime->execute_index_space(ctx, launcher);
+}
+
+FutureMap Experts::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "Experts op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  // int num_active_tokens = bc->num_active_tokens();
+  IndexLauncher launcher(EXPERTS_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  // expert predictions
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // expert assignment indices
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+  // topk_gate_preds
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[2]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[2]->region));
+  launcher.add_field(2, FID_DATA);
+  // expert output per token (only the chosen experts have non-zero
+  // contributions)
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(3, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(4, FID_DATA);
+  if (use_bias) {
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(5, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+void Experts::inference_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+  assert(regions.size() == task->regions.size());
+
+  ExpertsMeta const *m = *((ExpertsMeta **)task->local_args);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  if (bc->num_tokens == 0) {
+    return;
+  }
+
+  int num_experts = m->num_experts;
+  bool use_bias = m->use_bias;
+  assert(regions.size() - 4 == (1 + use_bias));
+
+  // get input, indices, topk_gate_preds, outputs
+  float const *input_ptr = helperGetTensorPointerRO<float>(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  int const *indices_ptr = helperGetTensorPointerRO<int>(
+      regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  float const *topk_gate_pred_ptr = helperGetTensorPointerRO<float>(
+      regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  float *output_ptr = helperGetTensorPointerWO<float>(
+      regions[3], task->regions[3], FID_DATA, ctx, runtime);
+  assert(input_ptr != nullptr && indices_ptr != nullptr &&
+         topk_gate_pred_ptr != nullptr && output_ptr != nullptr);
+
+  Domain input_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  Domain indices_domain = runtime->get_index_space_domain(
+      ctx, task->regions[1].region.get_index_space());
+  Domain topk_gate_pred_domain = runtime->get_index_space_domain(
+      ctx, task->regions[2].region.get_index_space());
+  Domain output_domain = runtime->get_index_space_domain(
+      ctx, task->regions[3].region.get_index_space());
+
+  int input_dims = input_domain.get_dim();
+  int indices_dims = indices_domain.get_dim();
+  int topk_gate_pred_dims = topk_gate_pred_domain.get_dim();
+  int output_dims = output_domain.get_dim();
+  assert(input_dims == indices_dims);
+  assert(indices_dims == topk_gate_pred_dims);
+  assert(input_dims == output_dims);
+
+  int replica_dim = input_dims - 1;
+  int samples_index = input_dims - 2;
+
+  coord_t data_dim = input_domain.hi()[0] - input_domain.lo()[0] + 1;
+  coord_t batch_size =
+      input_domain.hi()[samples_index] - input_domain.lo()[samples_index] + 1;
+  coord_t chosen_experts = indices_domain.hi()[0] - indices_domain.lo()[0] + 1;
+  coord_t out_dim = output_domain.hi()[0] - output_domain.lo()[0] + 1;
+  coord_t num_replicas =
+      input_domain.hi()[replica_dim] - input_domain.lo()[replica_dim] + 1;
+  assert(data_dim == m->data_dim);
+  assert(out_dim == m->out_dim);
+  assert(chosen_experts == m->num_chosen_experts);
+  assert(chosen_experts ==
+         topk_gate_pred_domain.hi()[0] - topk_gate_pred_domain.lo()[0] + 1);
+
+  for (int i = 1; i < input_dims; i++) {
+    int a = input_domain.hi()[i] - input_domain.lo()[i] + 1;
+    int b = indices_domain.hi()[i] - indices_domain.lo()[i] + 1;
+    int c = topk_gate_pred_domain.hi()[i] - topk_gate_pred_domain.lo()[i] + 1;
+    assert(a == b && b == c);
+    if (i >= 1 && i < samples_index) {
+      batch_size *= a;
+    }
+  }
+  assert(batch_size == m->effective_batch_size);
+
+  assert(batch_size <= MAX_BATCH_SIZE &&
+         "batch size exceeds MAX_BATCH_SIZE defined in experts.h");
+  assert(
+      num_experts <= MAX_EXPERTS_PER_BLOCK &&
+      "number of experts exceeds MAX_EXPERTS_PER_BLOCK defined in experts.h");
+
+  for (int j = 1; j < input_dims; j++) {
+    int a = input_domain.hi()[j] - input_domain.lo()[j] + 1;
+    int b = output_domain.hi()[j] - output_domain.lo()[j] + 1;
+    assert(a == b);
+  }
+
+  // get weights
+  float const *weights_ptr = helperGetTensorPointerRO<float>(
+      regions[4], task->regions[4], FID_DATA, ctx, runtime);
+  assert(weights_ptr != nullptr);
+  Domain weights_domain = runtime->get_index_space_domain(
+      ctx, task->regions[4].region.get_index_space());
+  int weights_dims = weights_domain.get_dim();
+  assert(weights_dims == 3);
+  int nparams_weight =
+      (m->experts_num_layers == 1)
+          ? (data_dim * out_dim)
+          : m->experts_internal_dim_size * (data_dim + out_dim);
+  assert(weights_domain.hi()[0] - weights_domain.lo()[0] + 1 == nparams_weight);
+  assert(weights_domain.hi()[1] - weights_domain.lo()[1] + 1 == num_experts);
+  assert(weights_domain.hi()[2] - weights_domain.lo()[2] + 1 == num_replicas);
+
+  float const *bias_ptr = nullptr;
+  int nparams_bias = -1;
+  if (use_bias) {
+    bias_ptr = helperGetTensorPointerRO<float>(
+        regions[5], task->regions[5], FID_DATA, ctx, runtime);
+    Domain bias_domain = runtime->get_index_space_domain(
+        ctx, task->regions[5].region.get_index_space());
+    int bias_dims = bias_domain.get_dim();
+    assert(bias_dims == 3);
+    nparams_bias = (m->experts_num_layers == 1)
+                       ? out_dim
+                       : (m->experts_internal_dim_size + out_dim);
+    assert(bias_domain.hi()[0] - bias_domain.lo()[0] + 1 == nparams_bias);
+    assert(bias_domain.hi()[1] - bias_domain.lo()[1] + 1 == num_experts);
+    assert(bias_domain.hi()[2] - bias_domain.lo()[2] + 1 == num_replicas);
+  }
+
+#ifdef INFERENCE_TESTS
+  if (DEBUG_MODE) {
+    std::cout << "forward_kernel_wrapper" << std::endl
+              << "-------------------------------" << std::endl;
+    std::cout << m->data_dim << std::endl;
+    std::cout << m->out_dim << std::endl;
+    std::cout << m->num_chosen_experts << std::endl;
+    std::cout << m->effective_batch_size << std::endl;
+    std::cout << m->experts_num_layers << std::endl;
+    std::cout << m->experts_internal_dim_size << std::endl;
+    std::cout << m->num_experts << std::endl;
+    std::cout << m->use_bias << std::endl;
+
+    /* ----------------Input Token--------------*/
+    float *cpu_input_ptr = new float[data_dim];
+    checkCUDA(cudaMemcpy(cpu_input_ptr,
+                         input_ptr,
+                         data_dim * sizeof(float),
+                         cudaMemcpyDeviceToHost));
+
+    srand(42);
+    float cpu_sum = 0;
+    for (int i = 0; i < data_dim; i++) {
+      // cpu_input_ptr[i] = (float)rand() / (float)RAND_MAX;
+      cpu_input_ptr[i] = float(i) / (float)data_dim;
+      cpu_sum += cpu_input_ptr[i];
+    }
+    std::cout << "[CPU] Token 0 sum = " << cpu_sum << std::endl;
+    std::cout << "Total token number = " << batch_size << std::endl;
+    for (int i = 0; i < batch_size; i++) {
+      checkCUDA(cudaMemcpy((float *)(input_ptr + i * data_dim),
+                           cpu_input_ptr,
+                           data_dim * sizeof(float),
+                           cudaMemcpyHostToDevice));
+    }
+    free(cpu_input_ptr);
+
+    /* ----------------indices--------------*/
+    int *cpu_indices_ptr = new int[chosen_experts * batch_size];
+    checkCUDA(cudaMemcpy(cpu_indices_ptr,
+                         indices_ptr,
+                         chosen_experts * batch_size * sizeof(int),
+                         cudaMemcpyDeviceToHost));
+    for (int i = 0; i < chosen_experts * 10; i++) {
+      if (i % 2 == 1) {
+        cpu_indices_ptr[i] += chosen_experts;
+      }
+    }
+    checkCUDA(cudaMemcpy((int *)indices_ptr,
+                         cpu_indices_ptr,
+                         chosen_experts * batch_size * sizeof(int),
+                         cudaMemcpyHostToDevice));
+    free(cpu_indices_ptr);
+
+    /* ----------------coefficient--------------*/
+    float *cpu_topk_gate_pred_ptr = new float[chosen_experts * batch_size];
+    checkCUDA(cudaMemcpy(cpu_topk_gate_pred_ptr,
+                         topk_gate_pred_ptr,
+                         chosen_experts * batch_size * sizeof(float),
+                         cudaMemcpyDeviceToHost));
+    for (int i = 0; i < chosen_experts * batch_size; i++) {
+      if (i % 2 == 0) {
+        cpu_topk_gate_pred_ptr[i] = 0.5;
+      } else {
+        cpu_topk_gate_pred_ptr[i] = 0.1;
+      }
+    }
+    checkCUDA(cudaMemcpy((float *)topk_gate_pred_ptr,
+                         cpu_topk_gate_pred_ptr,
+                         chosen_experts * batch_size * sizeof(float),
+                         cudaMemcpyHostToDevice));
+    free(cpu_topk_gate_pred_ptr);
+
+    /* ----------------Expert Weights--------------*/
+    assert(m->experts_num_layers == 2 || m->experts_num_layers == 1);
+    size_t layer0_size = m->experts_num_layers == 1
+                             ? data_dim * out_dim
+                             : data_dim * m->experts_internal_dim_size;
+    size_t layer1_size = m->experts_internal_dim_size * out_dim;
+    float *cpu_experts_0_layer0 = new float[layer0_size];
+    float *cpu_experts_1_layer0 = new float[layer0_size];
+    float *cpu_experts_0_layer1 =
+        m->experts_num_layers == 1 ? nullptr : new float[layer1_size];
+    float *cpu_experts_1_layer1 =
+        m->experts_num_layers == 1 ? nullptr : new float[layer1_size];
+    /*checkCUDA(cudaMemcpy(cpu_experts_0_layer0,
+                         weights_ptr,
+                         layer0_size * sizeof(float),
+                         cudaMemcpyDeviceToHost));
+    checkCUDA(cudaMemcpy(cpu_experts_1_layer0,
+                         weights_ptr[nparams_weight],
+                         layer0_size * sizeof(float),
+                         cudaMemcpyDeviceToHost));
+    if (m->experts_num_layers == 2) {
+      checkCUDA(cudaMemcpy(cpu_experts_0_layer1,
+                         weights_ptr[layer0_size],
+                         layer1_size * sizeof(float),
+                         cudaMemcpyDeviceToHost));
+      checkCUDA(cudaMemcpy(cpu_experts_1_layer1,
+                           weights_ptr[nparams_weight + layer0_size],
+                           layer1_size * sizeof(float),
+                           cudaMemcpyDeviceToHost));
+    }*/
+    cpu_sum = 0;
+    for (int i = 0; i < layer0_size; i++) {
+      cpu_experts_0_layer0[i] = float(i) / float(nparams_weight);
+      cpu_sum += cpu_experts_0_layer0[i];
+    }
+    if (m->experts_num_layers == 2) {
+      for (int i = 0; i < layer1_size; i++) {
+        cpu_experts_0_layer1[i] =
+            float(layer0_size + i) / float(nparams_weight);
+        cpu_sum += cpu_experts_0_layer1[i];
+      }
+    }
+    std::cout << "[CPU] Experts 0 weights sum = " << cpu_sum << std::endl;
+
+    cpu_sum = 0;
+    for (int i = 0; i < layer0_size; i++) {
+      cpu_experts_1_layer0[i] =
+          float(nparams_weight - i) / float(nparams_weight);
+      assert(cpu_experts_1_layer0[i] > 0);
+      cpu_sum += cpu_experts_1_layer0[i];
+    }
+    if (m->experts_num_layers == 2) {
+      for (int i = 0; i < layer1_size; i++) {
+        cpu_experts_1_layer1[i] =
+            float(nparams_weight - layer0_size + i) / float(nparams_weight);
+        assert(cpu_experts_1_layer1[i] > 0);
+        cpu_sum += cpu_experts_1_layer1[i];
+      }
+    }
+    std::cout << "[CPU] Experts 1 weights sum = " << cpu_sum << std::endl;
+
+    for (int i = 0; i < num_experts; i++) {
+      // first layer
+      checkCUDA(
+          cudaMemcpy((float *)&weights_ptr[nparams_weight * i],
+                     i % 2 == 0 ? cpu_experts_0_layer0 : cpu_experts_1_layer0,
+                     layer0_size * sizeof(float),
+                     cudaMemcpyHostToDevice));
+      // second layer
+      if (m->experts_num_layers == 2) {
+        checkCUDA(
+            cudaMemcpy((float *)&weights_ptr[nparams_weight * i + layer0_size],
+                       i % 2 == 0 ? cpu_experts_0_layer1 : cpu_experts_1_layer1,
+                       layer1_size * sizeof(float),
+                       cudaMemcpyHostToDevice));
+      }
+    }
+    free(cpu_experts_0_layer0);
+    free(cpu_experts_1_layer0);
+    free(cpu_experts_0_layer1);
+    free(cpu_experts_1_layer1);
+
+    /* ----------------Expert Bias--------------*/
+    if (use_bias) {
+      size_t layer0_size =
+          m->experts_num_layers == 1 ? out_dim : m->experts_internal_dim_size;
+      size_t layer1_size = out_dim;
+      float *bias_experts_0_layer0 = new float[layer0_size];
+      float *bias_experts_0_layer1 =
+          m->experts_num_layers == 1 ? nullptr : new float[layer1_size];
+
+      checkCUDA(cudaMemcpy(bias_experts_0_layer0,
+                           bias_ptr,
+                           layer0_size * sizeof(float),
+                           cudaMemcpyDeviceToHost));
+      cpu_sum = 0;
+      for (int i = 0; i < layer0_size; i++) {
+        cpu_sum += bias_experts_0_layer0[i];
+        // bias_experts_1[i] = 1.0f;
+      }
+      std::cout << "[CPU] Bias expert 0 (layer 0) sum = " << cpu_sum
+                << std::endl;
+
+      if (m->experts_num_layers == 2) {
+        checkCUDA(cudaMemcpy(bias_experts_0_layer1,
+                             (float *)&bias_ptr[layer0_size],
+                             layer1_size * sizeof(float),
+                             cudaMemcpyDeviceToHost));
+        cpu_sum = 0;
+        for (int i = 0; i < layer1_size; i++) {
+          cpu_sum += bias_experts_0_layer1[i];
+          // bias_experts_1[i] = 1.0f;
+        }
+        std::cout << "[CPU] Bias expert 0 (layer 1) sum = " << cpu_sum
+                  << std::endl;
+      }
+
+      for (int i = 0; i < num_experts; i++) {
+        checkCUDA(cudaMemcpy((float *)&bias_ptr[nparams_bias * i],
+                             bias_experts_0_layer0,
+                             layer0_size * sizeof(float),
+                             cudaMemcpyHostToDevice));
+        if (m->experts_num_layers == 2) {
+          checkCUDA(
+              cudaMemcpy((float *)&bias_ptr[nparams_bias * i + layer0_size],
+                         bias_experts_0_layer1,
+                         layer1_size * sizeof(float),
+                         cudaMemcpyHostToDevice));
+        }
+      }
+      free(bias_experts_0_layer0);
+      free(bias_experts_0_layer1);
+    }
+  }
+#endif
+  Experts::forward_kernel_wrapper(m,
+                                  input_ptr,
+                                  indices_ptr,
+                                  topk_gate_pred_ptr,
+                                  output_ptr,
+                                  weights_ptr,
+                                  bias_ptr,
+                                  bc->num_active_tokens(),
+                                  chosen_experts,
+                                  batch_size,
+                                  out_dim);
+#ifdef INFERENCE_TESTS
+  if (DEBUG_MODE) {
+    /* ----------------Output after computation--------------*/
+    float *cpu_output_ptr = new float[batch_size * out_dim];
+    float cpu_sum = 0;
+    checkCUDA(cudaMemcpy(cpu_output_ptr,
+                         output_ptr,
+                         batch_size * out_dim * sizeof(float),
+                         cudaMemcpyDeviceToHost));
+    for (int j = 0; j < batch_size * out_dim; j += out_dim) {
+      cpu_sum = 0;
+      for (int i = 0; i < out_dim; i++) {
+        cpu_sum += cpu_output_ptr[j + i];
+      }
+      // if ((j/out_dim) < 50) std::cout << "[CPU] output " << (j/out_dim) << "
+      // sum = " << cpu_sum << std::endl;
+      if (cpu_sum > 0.0f) {
+        std::cout << "[CPU] output " << (j / out_dim) << " sum = " << cpu_sum
+                  << std::endl;
+      }
+    }
+    std::cout << "[CPU] output 0's 10th element = " << cpu_output_ptr[10]
+              << std::endl;
+    std::cout << "[CPU] output 0's 99th element = " << cpu_output_ptr[99]
+              << std::endl;
+    std::cout << "[CPU] output 0's 123th element = " << cpu_output_ptr[123]
+              << std::endl;
+
+    /* refrence output */
+    /*
+     * Input token sum = 391.5
+     * Expert 0 weights sum = 307327.5
+     * Expert 1 weights sum = 307328.47
+     *  ------------------
+     * experts 0's reulst = 153533.1
+     * experts 1's reulst = 153402.9
+     * Aggreated Result = 92106.836
+     * 10th element = 41.28053
+     * 99th element = 59.057823
+     * 123th element = 63.8517
+     */
+
+    free(cpu_output_ptr);
+  }
+#endif
+}
+
+void Experts::forward_task(Task const *task,
+                           std::vector<PhysicalRegion> const &regions,
+                           Context ctx,
+                           Runtime *runtime) {
+  assert(false && "Experts is designed for inference only");
+}
+
+void Experts::backward(FFModel const &ff) {
+  assert(false && "Experts is designed for inference only");
+}
+
+void Experts::backward_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  assert(false && "Experts is designed for inference only");
+}
+
+void Experts::print_layer(FFModel const &ff) {
+  return;
+}
+
+bool Experts::measure_operator_cost(Simulator *sim,
+                                    MachineView const &c,
+                                    CostMetrics &cost_metrics) const {
+  // This is an inference only operator
+  assert(false && "Experts is designed for inference only");
+  return false;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::ExpertsParams>::operator()(
+    FlexFlow::ExpertsParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.num_experts);
+  hash_combine(key, params.experts_start_idx);
+  hash_combine(key, params.experts_output_dim_size);
+  hash_combine(key, params.alpha);
+  hash_combine(key, params.experts_num_layers);
+  hash_combine(key, params.experts_internal_dim_size);
+  hash_combine(key, params.use_bias);
+  hash_combine(key, params.activation);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/experts.cpp b/src/ops/experts.cpp
new file mode 100644
index 0000000000..c06f02a647
--- /dev/null
+++ b/src/ops/experts.cpp
@@ -0,0 +1,59 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/experts.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+/*static*/
+void Experts::forward_kernel_wrapper(ExpertsMeta const *m,
+                                     float const *input,
+                                     int const *indices,
+                                     float const *topk_gate_preds,
+                                     float *output,
+                                     float const *weights,
+                                     float const *biases,
+                                     int num_active_tokens,
+                                     int chosen_experts,
+                                     int batch_size,
+                                     int out_dim) {
+  // TODO: write the HIP version of the kernel after finishing the CUDA kernel
+  handle_unimplemented_hip_kernel(OP_EXPERTS);
+}
+
+ExpertsMeta::ExpertsMeta(FFHandler handler,
+                         int _num_experts,
+                         int _experts_start_idx,
+                         int _data_dim,
+                         int _out_dim,
+                         int _experts_num_layers,
+                         int _experts_internal_dim_size,
+                         int _effective_batch_size,
+                         int _num_chosen_experts,
+                         float _alpha,
+                         bool _use_bias,
+                         ActiMode _activation)
+    : OpMeta(handler), num_experts(_num_experts),
+      experts_start_idx(_experts_start_idx), data_dim(_data_dim),
+      out_dim(_out_dim), experts_num_layers(_experts_num_layers),
+      experts_internal_dim_size(_experts_internal_dim_size),
+      effective_batch_size(_effective_batch_size),
+      num_chosen_experts(_num_chosen_experts), alpha(_alpha),
+      use_bias(_use_bias), activation(_activation) {}
+ExpertsMeta::~ExpertsMeta(void) {}
+
+}; // namespace FlexFlow
diff --git a/src/ops/experts.cu b/src/ops/experts.cu
new file mode 100644
index 0000000000..ce15cdff55
--- /dev/null
+++ b/src/ops/experts.cu
@@ -0,0 +1,1447 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/experts.h"
+#include "flexflow/utils/cuda_helper.h"
+#include <cublas_v2.h>
+#include <cuda_runtime.h>
+
+// Thrust-related headers
+#define THRUST_IGNORE_DEPRECATED_CPP_DIALECT 1
+#include <thrust/binary_search.h>
+#include <thrust/copy.h>
+#include <thrust/device_ptr.h>
+#include <thrust/device_vector.h>
+#include <thrust/execution_policy.h>
+#include <thrust/functional.h>
+#include <thrust/sequence.h>
+#include <thrust/sort.h>
+#include <thrust/transform.h>
+#include <thrust/unique.h>
+
+#include <chrono>
+#include <thread>
+
+namespace FlexFlow {
+
+struct exceeds_expert_capacity {
+  int _expert_capacity;
+  exceeds_expert_capacity(int expert_capacity)
+      : _expert_capacity(expert_capacity){};
+  __host__ __device__ bool operator()(int x) {
+    return x > _expert_capacity;
+  }
+};
+
+void experts_forward_thrust_wrapper(ExpertsMeta const *m,
+                                    int const *indices,
+                                    int num_indices,
+                                    int experts_start_idx,
+                                    int num_experts_per_block,
+                                    int expert_capacity,
+                                    int *lb_index,
+                                    int *ub_index,
+                                    int *num_valid_assignments,
+                                    int *non_zero_experts_count,
+                                    int *start_indexes,
+                                    int *gemm_batch_count,
+                                    ffStream_t stream) {
+  // sort the indices and coefficients by expert. Keep track of the original
+  // position of each index/coefficient using the original_indices array
+  thrust::device_ptr<int const> thrust_indices =
+      thrust::device_pointer_cast(indices);
+  thrust::device_ptr<int> sorted_indices =
+      thrust::device_pointer_cast(m->sorted_indices);
+  thrust::copy(thrust::cuda::par.on(stream),
+               thrust_indices,
+               thrust_indices + num_indices,
+               sorted_indices);
+
+  thrust::device_ptr<int> original_indices =
+      thrust::device_pointer_cast(m->original_indices);
+  thrust::sequence(thrust::cuda::par.on(stream),
+                   original_indices,
+                   original_indices + num_indices);
+
+  thrust::stable_sort_by_key(thrust::cuda::par.on(stream),
+                             sorted_indices,
+                             sorted_indices + num_indices,
+                             original_indices);
+
+  // get lower and upper bound of token->expert assignments corresponding to
+  // experts in the block
+  thrust::device_ptr<int> lb = thrust::lower_bound(thrust::cuda::par.on(stream),
+                                                   sorted_indices,
+                                                   sorted_indices + num_indices,
+                                                   experts_start_idx);
+  thrust::device_ptr<int> ub =
+      thrust::upper_bound(thrust::cuda::par.on(stream),
+                          sorted_indices,
+                          sorted_indices + num_indices,
+                          experts_start_idx + num_experts_per_block - 1);
+  // lowest index in the sorted indices array corresponding to an expert within
+  // the block
+  *lb_index = lb - sorted_indices;
+  // 1 + largest index in the sorted indices array corresponding to an expert
+  // within the block
+  *ub_index = ub - sorted_indices;
+  *num_valid_assignments = (*ub_index) - (*lb_index);
+  if ((*num_valid_assignments) == 0) {
+    return;
+  }
+
+  thrust::device_ptr<int> non_zero_expert_labels =
+      thrust::device_pointer_cast(m->non_zero_expert_labels);
+  // non_zero_expert_labels: a list of global labels of the experts in this
+  // block receiving nonzero tokens
+  thrust::device_ptr<int> non_zero_expert_labels_end = thrust::unique_copy(
+      thrust::cuda::par.on(stream), lb, ub, non_zero_expert_labels);
+  // number of experts in this block receiving at least one token
+  *non_zero_experts_count = non_zero_expert_labels_end - non_zero_expert_labels;
+
+  using namespace thrust::placeholders;
+  // convert global labels to local labelling (e.g. expert 65->index 65-64=1 in
+  // block containing experts 64-96) by substracting the experts_start_idx,
+  // inplace.
+  thrust::for_each(thrust::cuda::par.on(stream),
+                   non_zero_expert_labels,
+                   non_zero_expert_labels + (*non_zero_experts_count),
+                   _1 -= experts_start_idx);
+
+  thrust::device_ptr<int> temp_sequence =
+      thrust::device_pointer_cast(m->temp_sequence);
+  thrust::sequence(thrust::cuda::par.on(stream),
+                   temp_sequence,
+                   temp_sequence + (*non_zero_experts_count));
+
+  // create "exp_local_label_to_index", a mapping from local expert label to its
+  // non-zero expert index (i.e. expert with index i is the i-th expert in the
+  // block to receive at least 1 token)
+  thrust::device_ptr<int> exp_local_label_to_index =
+      thrust::device_pointer_cast(m->exp_local_label_to_index);
+  thrust::scatter(thrust::cuda::par.on(stream),
+                  temp_sequence,
+                  temp_sequence + (*non_zero_experts_count),
+                  non_zero_expert_labels,
+                  exp_local_label_to_index);
+
+  // get local start index (within lower/upper bound) for each expert receiving
+  // non-zero tokens
+  thrust::device_ptr<int> expert_start_indexes =
+      thrust::device_pointer_cast(m->expert_start_indexes);
+  thrust::sequence(thrust::cuda::par.on(stream),
+                   expert_start_indexes,
+                   expert_start_indexes + (*num_valid_assignments));
+  *start_indexes = (thrust::unique_by_key_copy(thrust::cuda::par.on(stream),
+                                               lb,
+                                               ub,
+                                               expert_start_indexes,
+                                               temp_sequence,
+                                               expert_start_indexes))
+                       .first -
+                   temp_sequence;
+  assert((*start_indexes) == (*non_zero_experts_count));
+
+  // append ub_index
+  expert_start_indexes[(*start_indexes)] = (*ub_index) - (*lb_index);
+
+  // get number of token assignment to each expert
+  thrust::device_ptr<int> num_assignments_per_expert =
+      thrust::device_pointer_cast(m->num_assignments_per_expert);
+  thrust::transform(thrust::cuda::par.on(stream),
+                    expert_start_indexes + 1,
+                    expert_start_indexes + (*non_zero_experts_count) + 1,
+                    expert_start_indexes,
+                    num_assignments_per_expert,
+                    thrust::minus<int>());
+
+  // build destination_start_index array, telling us the first slot that belongs
+  // to each expert in the destination array (after factoring in expert
+  // capacity)
+  thrust::device_ptr<int> destination_start_indices =
+      thrust::device_pointer_cast(m->destination_start_indices);
+  thrust::replace_copy_if(thrust::cuda::par.on(stream),
+                          num_assignments_per_expert,
+                          num_assignments_per_expert +
+                              (*non_zero_experts_count),
+                          destination_start_indices,
+                          exceeds_expert_capacity(expert_capacity),
+                          expert_capacity);
+
+  *gemm_batch_count =
+      thrust::reduce(thrust::cuda::par.on(stream),
+                     destination_start_indices,
+                     destination_start_indices + (*non_zero_experts_count));
+
+  thrust::exclusive_scan(thrust::cuda::par.on(stream),
+                         destination_start_indices,
+                         destination_start_indices + (*non_zero_experts_count),
+                         destination_start_indices,
+                         0);
+}
+
+__global__ void experts_forward_prepare_kernel(
+    int num_valid_assignments,
+    int expert_capacity,
+    int lb_index,
+    int experts_start_idx,
+    int num_experts_per_block,
+    int num_chosen_experts,
+    int data_dim,
+    int out_dim,
+    int experts_num_layers,
+    int experts_internal_dim_size,
+    bool use_bias,
+    int *sorted_indices,
+    int *expert_start_indexes,
+    int *exp_local_label_to_index,
+    int *destination_start_indices,
+    int *original_indices,
+    float const *input, // @In: Tokens' values (in_dim, batch_size)
+    float *output,
+    float const **token_idx_array,   // @Out: Barray for GemmBatchedEx
+    float const *weights,            // @In: Experts' weights
+    float const *biases,             // @In: Experts' biases
+    float const **weight_idx_array1, // @Out: Aarray for GemmBatchedEx
+    float const **weight_idx_array2,
+    float const **bias_idx_array1, // @Out: Experts' bias
+    float const **bias_idx_array2,
+    float const *coefficients, // @In: topk_gate_predss coefficients tensor
+                               // (num_chosen_experts, batch_size)
+    float const **coefficient_idx_array, // @Out: Barray for Aggregation
+    float **output_idx_array) {
+
+  CUDA_KERNEL_LOOP(i, num_valid_assignments) {
+    int global_expert_label = sorted_indices[lb_index + i];
+    assert(global_expert_label >= experts_start_idx &&
+           global_expert_label < experts_start_idx + num_experts_per_block);
+    int local_expert_label = global_expert_label - experts_start_idx;
+    int expert_index = exp_local_label_to_index[local_expert_label];
+    int within_expert_offset = i - expert_start_indexes[expert_index];
+    int weight_params_count =
+        experts_num_layers == 1
+            ? data_dim * out_dim
+            : experts_internal_dim_size * (data_dim + out_dim);
+    if (within_expert_offset < expert_capacity) {
+      int rev_idx = original_indices[i + lb_index];
+      int token_idx = (rev_idx / num_chosen_experts);
+
+      token_idx_array[destination_start_indices[expert_index] +
+                      within_expert_offset] = &input[token_idx * data_dim];
+      weight_idx_array1[destination_start_indices[expert_index] +
+                        within_expert_offset] =
+          &weights[local_expert_label * weight_params_count];
+      if (experts_num_layers == 2) {
+        weight_idx_array2[destination_start_indices[expert_index] +
+                          within_expert_offset] =
+            &weights[local_expert_label * weight_params_count +
+                     (data_dim * experts_internal_dim_size)];
+      }
+      if (use_bias) {
+        int bias_params_count = (experts_num_layers == 1)
+                                    ? out_dim
+                                    : (experts_internal_dim_size + out_dim);
+        bias_idx_array1[destination_start_indices[expert_index] +
+                        within_expert_offset] =
+            &biases[local_expert_label * bias_params_count];
+        if (experts_num_layers == 2) {
+          bias_idx_array2[destination_start_indices[expert_index] +
+                          within_expert_offset] =
+              &biases[local_expert_label * bias_params_count +
+                      experts_internal_dim_size];
+        }
+      }
+      coefficient_idx_array[destination_start_indices[expert_index] +
+                            within_expert_offset] = &coefficients[rev_idx];
+      output_idx_array[destination_start_indices[expert_index] +
+                       within_expert_offset] = &output[token_idx * out_dim];
+    }
+  }
+}
+
+bool use_activation(ActiMode mode) {
+  switch (mode) {
+    case AC_MODE_RELU:
+    case AC_MODE_SIGMOID:
+    case AC_MODE_TANH:
+      return true;
+    case AC_MODE_NONE:
+      return false;
+    default:
+      assert(0);
+      break;
+  }
+  return false;
+}
+
+void experts_forward_GemmBatched_kernel(ExpertsMeta const *m,
+                                        void const **weights_ptr1,
+                                        void const **weights_ptr2,
+                                        void const **input_ptr,
+                                        void **results_ptr1,
+                                        void **results_ptr2,
+                                        void const **bias_ptr1,
+                                        void const **bias_ptr2,
+                                        ActiMode activation,
+                                        int in_dim,
+                                        int out_dim,
+                                        int experts_num_layers,
+                                        int experts_internal_dim_size,
+                                        int num_tokens,
+                                        int num_chosen_experts,
+                                        int gemm_batch_count,
+                                        ffStream_t stream) {
+
+  checkCUDA(cublasSetStream(m->handle.blas, stream));
+  checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
+
+  float alpha = 1.0f, beta = 0.0f;
+
+  // cudaDataType_t input_type = ff_to_cuda_datatype(m->input_type);
+  // cudaDataType_t weight_type = ff_to_cuda_datatype(m->weight_type);
+  // cudaDataType_t output_type = ff_to_cuda_datatype(m->output_type);
+  cudaDataType_t input_type = CUDA_R_32F;
+  cudaDataType_t weight_type = CUDA_R_32F;
+  cudaDataType_t output_type = CUDA_R_32F;
+
+  cublasComputeType_t compute_type = CUBLAS_COMPUTE_32F;
+
+  int m_ = out_dim;
+  int n = 1;
+  int k = in_dim;
+  void const **A = weights_ptr1;
+  void const **B = input_ptr;
+  void **C = results_ptr1;
+  int lda = in_dim;
+  int ldb = in_dim;
+  int ldc = out_dim;
+  if (experts_num_layers == 2) {
+    m_ = ldc = experts_internal_dim_size;
+  }
+  checkCUDA(cublasGemmBatchedEx(
+      m->handle.blas,
+      CUBLAS_OP_T, // Tranpose Weight, shape (in_dim, out_dim) => (out_dim,
+                   // in_dim)
+      CUBLAS_OP_N, // Input_token, shape (in_dim, 1)
+      m_,          // num_row of (A, C) = out_dim
+      n,           // num_col of (B, C) = 1
+      k,           // num_col of A and num_rows of B = in_dim
+      &alpha,
+      A, // Aarray (num_tokens * chosen_experts, in_dim, out_dim)
+      weight_type,
+      lda, // Leading Dimension of weight before transpose
+      B,   // Barray (num_tokens * chosen_experts, in_dim, 1)
+      input_type,
+      ldb, // Leading Dimension of input_token
+      &beta,
+      C, // Carray (num_tokens * chosen_experts, out_dim, 1)
+      output_type,
+      ldc,              // Leading Dimension of output
+      gemm_batch_count, // Total submatrixes
+      compute_type,
+      CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+  if (m->use_bias) {
+    m_ = out_dim;
+    n = 1;
+    k = 1;
+    A = bias_ptr1;
+    B = (void const **)m->one_ptr_array;
+    C = results_ptr1;
+    lda = out_dim;
+    ldb = 1;
+    ldc = out_dim;
+    if (experts_num_layers == 2) {
+      m_ = lda = ldc = experts_internal_dim_size;
+    }
+    alpha = 1.0f, beta = 0.0f;
+    checkCUDA(cublasGemmBatchedEx(
+        m->handle.blas,
+        CUBLAS_OP_N, // Bias, shape (out_dim, 1)
+        CUBLAS_OP_N, // Coefficient, shape (1, 1)
+        m_,          // num_row of (A, C) = out_dim
+        n,           // num_col of (B, C) = 1
+        k,           // num_col of A and num_rows of B = 1
+        &alpha,
+        A, // bias tensor (out_dim, 1)
+        weight_type,
+        lda, // Leading Dimension of bias tensor
+        B,   // all-one tensor (1, 1)
+        CUDA_R_32F,
+        ldb, // Leading Dimension of all-one tensor
+        &alpha,
+        C, // Carray (num_tokens * chosen_experts, out_dim, 1)
+        output_type,
+        ldc,              // Leading Dimension of output
+        gemm_batch_count, // Total submatrixs
+        compute_type,
+        CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+  }
+
+  if (use_activation(activation)) {
+    alpha = 1.0f, beta = 0.0f;
+    checkCUDNN(cudnnActivationForward(m->handle.dnn,
+                                      m->actiDesc,
+                                      &alpha,
+                                      m->resultTensorDesc1,
+                                      m->batch_outputs1[0],
+                                      &beta,
+                                      m->resultTensorDesc1,
+                                      m->batch_outputs1[0]));
+  }
+
+  if (experts_num_layers == 2) {
+    m_ = out_dim;
+    n = 1;
+    k = experts_internal_dim_size;
+    A = weights_ptr2;
+    B = (void const **)results_ptr1;
+    C = results_ptr2;
+    lda = experts_internal_dim_size;
+    ldb = experts_internal_dim_size;
+    ldc = out_dim;
+    alpha = 1.0f, beta = 0.0f;
+    checkCUDA(cublasGemmBatchedEx(
+        m->handle.blas,
+        CUBLAS_OP_T, // Tranpose Weight, shape (in_dim, out_dim) => (out_dim,
+                     // in_dim)
+        CUBLAS_OP_N, // Input_token, shape (in_dim, 1)
+        m_,          // num_row of (A, C) = out_dim
+        n,           // num_col of (B, C) = 1
+        k,           // num_col of A and num_rows of B = in_dim
+        &alpha,
+        A, // Aarray (num_tokens * chosen_experts, in_dim, out_dim)
+        weight_type,
+        lda, // Leading Dimension of weight before transpose
+        B,   // Barray (num_tokens * chosen_experts, in_dim, 1)
+        input_type,
+        ldb, // Leading Dimension of input_token
+        &beta,
+        C, // Carray (num_tokens * chosen_experts, out_dim, 1)
+        output_type,
+        ldc,              // Leading Dimension of output
+        gemm_batch_count, // Total submatrixes
+        compute_type,
+        CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+    if (m->use_bias) {
+      m_ = out_dim;
+      n = 1;
+      k = 1;
+      A = bias_ptr2;
+      B = (void const **)m->one_ptr_array;
+      C = results_ptr2;
+      lda = out_dim;
+      ldb = 1;
+      ldc = out_dim;
+      alpha = 1.0f, beta = 0.0f;
+      checkCUDA(cublasGemmBatchedEx(
+          m->handle.blas,
+          CUBLAS_OP_N, // Bias, shape (out_dim, 1)
+          CUBLAS_OP_N, // Coefficient, shape (1, 1)
+          m_,          // num_row of (A, C) = out_dim
+          n,           // num_col of (B, C) = 1
+          k,           // num_col of A and num_rows of B = 1
+          &alpha,
+          A, // bias tensor (out_dim, 1)
+          weight_type,
+          lda, // Leading Dimension of bias tensor
+          B,   // all-one tensor (1, 1)
+          CUDA_R_32F,
+          ldb, // Leading Dimension of all-one tensor
+          &alpha,
+          C, // Carray (num_tokens * chosen_experts, out_dim, 1)
+          output_type,
+          ldc,              // Leading Dimension of output
+          gemm_batch_count, // Total submatrixs
+          compute_type,
+          CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+    }
+
+    if (use_activation(activation)) {
+      alpha = 1.0f, beta = 0.0f;
+      checkCUDNN(cudnnActivationForward(m->handle.dnn,
+                                        m->actiDesc,
+                                        &alpha,
+                                        m->resultTensorDesc2,
+                                        m->batch_outputs2[0],
+                                        &beta,
+                                        m->resultTensorDesc2,
+                                        m->batch_outputs2[0]));
+    }
+  }
+}
+
+__global__ void experts_forward_aggregate_kernel(int num_tokens,
+                                                 int gemm_batch_count,
+                                                 int out_dim,
+                                                 float *output,
+                                                 float **results_ptr,
+                                                 float const **coefficient_ptr,
+                                                 float **output_ptr) {
+
+  CUDA_KERNEL_LOOP(i, num_tokens * out_dim) {
+    output[i] = 0.0f;
+  }
+
+  __syncthreads();
+
+  CUDA_KERNEL_LOOP(i, gemm_batch_count * out_dim) {
+    int token_index = i / out_dim;
+    int emb_index = i % out_dim;
+    float res =
+        results_ptr[token_index][emb_index] * (*coefficient_ptr[token_index]);
+    atomicAdd(output_ptr[token_index] + emb_index, res);
+  }
+}
+
+/*static*/
+void Experts::forward_kernel_wrapper(ExpertsMeta const *m,
+                                     float const *input,
+                                     int const *indices,
+                                     float const *topk_gate_preds,
+                                     float *output,
+                                     float const *weights,
+                                     float const *biases,
+                                     int num_active_tokens,
+                                     int chosen_experts,
+                                     int batch_size,
+                                     int out_dim) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  assert(num_active_tokens > 0);
+  assert(num_active_tokens <= m->effective_batch_size);
+  assert(m->effective_batch_size == batch_size);
+
+  int num_experts_per_block = m->num_experts;
+  int experts_start_idx = m->experts_start_idx;
+  bool use_bias = m->use_bias;
+  ActiMode activation = m->activation;
+  int data_dim = m->data_dim;
+  int num_chosen_experts = m->num_chosen_experts;
+  // int num_tokens = m->effective_batch_size;
+  int num_tokens = num_active_tokens;
+  int expert_capacity = m->expert_capacity;
+
+  assert(chosen_experts == num_chosen_experts);
+  // assert(num_tokens == batch_size);
+  assert(out_dim == m->out_dim);
+
+  assert(weights != nullptr);
+  assert(use_bias == (biases != nullptr));
+
+  int num_indices = num_tokens * num_chosen_experts;
+  // values below are set by Thrust in the experts_forward_thrust_wrapper
+  // function
+  int lb_index = 0;
+  int ub_index = 0;
+  int num_valid_assignments = 0;
+  int non_zero_experts_count = 0;
+  int start_indexes = 0;
+  int gemm_batch_count = 0;
+
+  experts_forward_thrust_wrapper(m,
+                                 indices,
+                                 num_indices,
+                                 experts_start_idx,
+                                 num_experts_per_block,
+                                 expert_capacity,
+                                 &lb_index,
+                                 &ub_index,
+                                 &num_valid_assignments,
+                                 &non_zero_experts_count,
+                                 &start_indexes,
+                                 &gemm_batch_count,
+                                 stream);
+
+  // checkCUDA(cudaStreamSynchronize(stream));
+
+#ifdef INFERENCE_TESTS
+  // Checking
+  // 1. check that m->sorted_indices contains indices sorted
+  int *indices_cpu = download_tensor<int>(indices, num_indices);
+  // assert(indices_cpu != nullptr);
+  std::vector<int> indices_vec(indices_cpu, indices_cpu + num_indices);
+  std::vector<int> indices_vec_sorted(indices_vec.size());
+  std::copy(indices_vec.begin(), indices_vec.end(), indices_vec_sorted.begin());
+  std::stable_sort(indices_vec_sorted.begin(), indices_vec_sorted.end());
+
+  int *thrust_sorted_indices_cpu = download_tensor<int>(
+      m->sorted_indices, m->num_chosen_experts * m->effective_batch_size);
+  // assert(thrust_sorted_indices_cpu != nullptr);
+  std::vector<int> thrust_sorted_indices_vec(
+      thrust_sorted_indices_cpu, thrust_sorted_indices_cpu + num_indices);
+  for (int i = 0; i < num_indices; i++) {
+    if (indices_vec_sorted[i] != thrust_sorted_indices_vec[i]) {
+      printf("i=%i\n", i);
+      printf("indices: ");
+      std::copy(indices_vec.begin(),
+                indices_vec.end(),
+                std::ostream_iterator<int>(std::cout, " "));
+      std::cout << std::endl;
+      printf("indices_vec_sorted: ");
+      std::copy(indices_vec_sorted.begin(),
+                indices_vec_sorted.end(),
+                std::ostream_iterator<int>(std::cout, " "));
+      std::cout << std::endl;
+      printf("thrust_sorted_indices_vec: ");
+      std::copy(thrust_sorted_indices_vec.begin(),
+                thrust_sorted_indices_vec.end(),
+                std::ostream_iterator<int>(std::cout, " "));
+      std::cout << std::endl;
+    }
+    assert(indices_vec_sorted[i] == thrust_sorted_indices_vec[i]);
+  }
+  // 2. check that indices[m->original_indices[i]] = i
+  int *thrust_original_indices_cpu = download_tensor<int>(
+      m->original_indices, m->num_chosen_experts * m->effective_batch_size);
+  // assert(thrust_original_indices_cpu != nullptr);
+  std::vector<int> thrust_original_indices_vec(
+      thrust_original_indices_cpu, thrust_original_indices_cpu + num_indices);
+  for (int i = 0; i < num_indices; i++) {
+    assert(indices_vec[thrust_original_indices_vec[i]] ==
+           thrust_sorted_indices_vec[i]);
+  }
+
+  // 3. check that lb_index is the index of the first element greater or equal
+  // to expert_start_idx
+  // 4. check that ub_index is greater than last, or outside array
+  std::vector<int>::iterator low, up;
+  low = std::lower_bound(
+      indices_vec_sorted.begin(), indices_vec_sorted.end(), experts_start_idx);
+  up = std::upper_bound(indices_vec_sorted.begin(),
+                        indices_vec_sorted.end(),
+                        experts_start_idx + num_experts_per_block - 1);
+  int lb_index_check = low - indices_vec_sorted.begin(),
+      ub_index_check = up - indices_vec_sorted.begin();
+
+  if (lb_index_check != lb_index || ub_index_check != ub_index) {
+    printf("experts_start_idx: %i, num_experts_per_block: %i, lb_index: %i, "
+           "lb_index_check: %i, ub_index: %i, ub_index_check: %i\n",
+           experts_start_idx,
+           num_experts_per_block,
+           lb_index,
+           lb_index_check,
+           ub_index,
+           ub_index_check);
+    printf("indices_vec_sorted: ");
+    std::copy(indices_vec_sorted.begin(),
+              indices_vec_sorted.end(),
+              std::ostream_iterator<int>(std::cout, " "));
+    std::cout << std::endl;
+  }
+  assert(lb_index_check == lb_index);
+  assert(ub_index_check == ub_index);
+
+  // 5. compute num_valid_assignments manually, and check that is equal to value
+  // computed in thrust
+  int num_valid_assignments_manual = ub_index_check - lb_index_check;
+  assert(num_valid_assignments_manual == num_valid_assignments);
+
+  // 6. check m->non_zero_expert_labels, *non_zero_experts_count
+  std::set<int> non_zero_experts_check;
+  for (int i = 0; i < num_indices; i++) {
+    if (indices_vec_sorted[i] >= experts_start_idx &&
+        indices_vec_sorted[i] < experts_start_idx + num_experts_per_block) {
+      non_zero_experts_check.insert(indices_vec_sorted[i]);
+    }
+  }
+  assert(non_zero_experts_count == non_zero_experts_check.size());
+  // 7. check exp_local_label_to_index
+  int *non_zero_expert_labels_cpu =
+      download_tensor<int>(m->non_zero_expert_labels, non_zero_experts_count);
+  // assert(non_zero_expert_labels_cpu != nullptr);
+  std::vector<int> non_zero_expert_labels_vec(non_zero_expert_labels_cpu,
+                                              non_zero_expert_labels_cpu +
+                                                  non_zero_experts_count);
+  assert(std::is_sorted(non_zero_expert_labels_vec.begin(),
+                        non_zero_expert_labels_vec.end()));
+  std::vector<int> non_zero_experts_check_vec;
+  for (auto el : non_zero_experts_check) {
+    non_zero_experts_check_vec.push_back(el - experts_start_idx);
+  }
+  assert(std::is_sorted(non_zero_experts_check_vec.begin(),
+                        non_zero_experts_check_vec.end()));
+  assert(non_zero_expert_labels_vec == non_zero_experts_check_vec);
+
+  int *exp_local_label_to_index =
+      download_tensor<int>(m->exp_local_label_to_index, non_zero_experts_count);
+  // assert(exp_local_label_to_index != nullptr);
+  std::vector<int> exp_local_label_to_index_vec(exp_local_label_to_index,
+                                                exp_local_label_to_index +
+                                                    non_zero_experts_count);
+  int z = 0;
+  for (int i = 0; i < non_zero_experts_count; i++) {
+    if (non_zero_experts_check.find(i) != non_zero_experts_check.end()) {
+      assert(exp_local_label_to_index_vec[i] == z);
+      z++;
+    }
+  }
+
+  // 8. Check expert_start_indexes
+  int *expert_start_indices_thrust =
+      download_tensor<int>(m->expert_start_indexes, non_zero_experts_count + 1);
+  // assert(expert_start_indices_thrust != nullptr);
+  std::vector<int> expert_start_indices_thrust_vec(
+      expert_start_indices_thrust,
+      expert_start_indices_thrust + non_zero_experts_count + 1);
+  std::vector<int> expert_start_indices_cpu;
+  std::set<int> exp_label;
+
+  std::vector<int> num_assignments_per_expert_cpu;
+
+  for (int i = lb_index; i < ub_index; i++) {
+    assert(indices_vec_sorted[i] >= experts_start_idx &&
+           indices_vec_sorted[i] < experts_start_idx + num_experts_per_block);
+    if (exp_label.find(indices_vec_sorted[i]) == exp_label.end()) {
+      exp_label.insert(indices_vec_sorted[i]);
+      expert_start_indices_cpu.push_back(i - lb_index);
+
+      num_assignments_per_expert_cpu.push_back(1);
+    } else {
+      num_assignments_per_expert_cpu[num_assignments_per_expert_cpu.size() -
+                                     1] += 1;
+    }
+  }
+  expert_start_indices_cpu.push_back(ub_index - lb_index);
+  assert(num_assignments_per_expert_cpu.size() == non_zero_experts_count);
+  /* std::cout << "indices_vec_sorted: ";
+  for (int i=lb_index; i<ub_index; i++) {
+    std::cout << indices_vec_sorted[i] << " ";
+  }
+  std::cout << "expert_start_indices_cpu: ";
+  for (int i=0; i<expert_start_indices_cpu.size(); i++) {
+    std::cout << expert_start_indices_cpu[i] << " ";
+  }
+  std::cout << std::endl;
+  std::cout << "expert_start_indices_thrust_vec: ";
+  for (int i=0; i<expert_start_indices_thrust_vec.size(); i++) {
+    std::cout << expert_start_indices_thrust_vec[i] << " ";
+  }
+  std::cout << std::endl; */
+  assert(std::is_sorted(expert_start_indices_cpu.begin(),
+                        expert_start_indices_cpu.end()));
+  assert(expert_start_indices_cpu == expert_start_indices_thrust_vec);
+
+  int *num_assignments_per_expert_thrust =
+      (int *)calloc(non_zero_experts_count, sizeof(int));
+  assert(num_assignments_per_expert_thrust != nullptr);
+  assert(download_tensor<int>(m->num_assignments_per_expert,
+                              num_assignments_per_expert_thrust,
+                              non_zero_experts_count));
+  assert(num_assignments_per_expert_thrust != nullptr);
+  std::vector<int> num_assignments_per_expert_thrust_vec(
+      num_assignments_per_expert_thrust,
+      num_assignments_per_expert_thrust + non_zero_experts_count);
+  assert(num_assignments_per_expert_cpu ==
+         num_assignments_per_expert_thrust_vec);
+
+  int *destination_start_indices_thrust =
+      (int *)calloc(non_zero_experts_count, sizeof(int));
+  assert(destination_start_indices_thrust != nullptr);
+  assert(download_tensor<int>(m->destination_start_indices,
+                              destination_start_indices_thrust,
+                              non_zero_experts_count));
+  assert(destination_start_indices_thrust != nullptr);
+  std::vector<int> destination_start_indices_thrust_vec(
+      destination_start_indices_thrust,
+      destination_start_indices_thrust + non_zero_experts_count);
+  std::vector<int> destination_start_indices_cpu;
+  int gemm_batch_count_cpu = 0;
+  for (int i = 0; i < num_assignments_per_expert_cpu.size(); i++) {
+    if (i == 0) {
+      destination_start_indices_cpu.push_back(0);
+    } else {
+      destination_start_indices_cpu.push_back(
+          std::min(expert_capacity, num_assignments_per_expert_cpu[i - 1]));
+    }
+  }
+  for (int i = 0; i < num_assignments_per_expert_cpu.size(); i++) {
+    gemm_batch_count_cpu +=
+        std::min(expert_capacity, num_assignments_per_expert_cpu[i]);
+  }
+  for (int i = 1; i < destination_start_indices_cpu.size(); i++) {
+    destination_start_indices_cpu[i] += destination_start_indices_cpu[i - 1];
+  }
+  /*
+  std::cout << "destination_start_indices_cpu: ";
+  for (int i=0; i<destination_start_indices_cpu.size(); i++) {
+    std::cout << destination_start_indices_cpu[i] << " ";
+  }
+  std::cout << std::endl;
+  std::cout << "destination_start_indices_thrust_vec: ";
+  for (int i=0; i<destination_start_indices_thrust_vec.size(); i++) {
+    std::cout << destination_start_indices_thrust_vec[i] << " ";
+  }
+  std::cout << std::endl; */
+  assert(destination_start_indices_cpu == destination_start_indices_thrust_vec);
+  assert(gemm_batch_count == gemm_batch_count_cpu);
+
+  checkCUDA(cudaFreeHost(thrust_sorted_indices_cpu));
+  checkCUDA(cudaFreeHost(thrust_original_indices_cpu));
+  checkCUDA(cudaFreeHost(non_zero_expert_labels_cpu));
+  checkCUDA(cudaFreeHost(exp_local_label_to_index));
+  checkCUDA(cudaFreeHost(expert_start_indices_thrust));
+  free(num_assignments_per_expert_thrust);
+  free(destination_start_indices_thrust);
+
+  non_zero_experts_check_vec.clear();
+  non_zero_experts_check_vec.shrink_to_fit();
+  expert_start_indices_cpu.clear();
+  expert_start_indices_cpu.shrink_to_fit();
+  destination_start_indices_cpu.clear();
+  destination_start_indices_cpu.shrink_to_fit();
+#endif
+
+  assert(ub_index - lb_index == num_valid_assignments);
+  assert(num_valid_assignments >= non_zero_experts_count);
+  assert(non_zero_experts_count <= num_experts_per_block);
+  if (non_zero_experts_count == 0) {
+    assert(num_valid_assignments == 0 && gemm_batch_count == 0);
+  } else {
+    assert(num_valid_assignments > 0 && gemm_batch_count > 0);
+  }
+  assert(num_valid_assignments <= num_indices);
+  assert(gemm_batch_count <= num_valid_assignments);
+
+  if (num_valid_assignments == 0) {
+    if (m->profiling) {
+      cudaEventRecord(t_end, stream);
+      cudaEventSynchronize(t_end);
+      float milliseconds = 0;
+      cudaEventElapsedTime(&milliseconds, t_start, t_end);
+      printf("forward_kernel_wrapper: %f ms\n", milliseconds);
+    }
+    return;
+  }
+
+  experts_forward_prepare_kernel<<<GET_BLOCKS(num_valid_assignments),
+                                   min(CUDA_NUM_THREADS,
+                                       (int)num_valid_assignments),
+                                   0,
+                                   stream>>>(num_valid_assignments,
+                                             expert_capacity,
+                                             lb_index,
+                                             experts_start_idx,
+                                             num_experts_per_block,
+                                             num_chosen_experts,
+                                             data_dim,
+                                             out_dim,
+                                             m->experts_num_layers,
+                                             m->experts_internal_dim_size,
+                                             use_bias,
+                                             m->sorted_indices,
+                                             m->expert_start_indexes,
+                                             m->exp_local_label_to_index,
+                                             m->destination_start_indices,
+                                             m->original_indices,
+                                             input,
+                                             output,
+                                             m->token_idx_array,
+                                             weights,
+                                             biases,
+                                             m->weight_idx_array1,
+                                             m->weight_idx_array2,
+                                             m->bias_idx_array1,
+                                             m->bias_idx_array2,
+                                             topk_gate_preds,
+                                             m->coefficient_idx_array,
+                                             m->output_idx_array);
+
+  // checkCUDA(cudaStreamSynchronize(stream));
+
+#ifdef INFERENCE_TESTS
+  std::vector<float const *> token_ptrs, weight_ptrs, bias_ptrs,
+      coefficient_ptrs;
+  std::vector<float *> output_ptrs;
+  std::map<int, int> num_t_per_exp;
+  for (int i = 0; i < num_indices; i++) {
+    int global_exp_label = indices_vec[i];
+
+    if (global_exp_label >= experts_start_idx &&
+        global_exp_label < experts_start_idx + num_experts_per_block &&
+        (num_t_per_exp.find(global_exp_label) == num_t_per_exp.end() ||
+         num_t_per_exp[global_exp_label] < expert_capacity)) {
+      if (num_t_per_exp.find(global_exp_label) == num_t_per_exp.end()) {
+        num_t_per_exp[global_exp_label] = 1;
+      } else {
+        num_t_per_exp[global_exp_label] = num_t_per_exp[global_exp_label] + 1;
+      }
+      int token_idx = i / num_chosen_experts;
+      // std::cout << "Push back token_idx (" << token_idx << ") * data_dim ("
+      // << data_dim << "): " << token_idx*data_dim << std::endl;
+
+      token_ptrs.push_back(&input[token_idx * data_dim]);
+      coefficient_ptrs.push_back(&topk_gate_preds[i]);
+      int local_exp_label = global_exp_label - experts_start_idx;
+      weight_ptrs.push_back(&weights[local_exp_label * (out_dim * data_dim)]);
+      output_ptrs.push_back(&output[token_idx * out_dim]);
+      if (use_bias) {
+        bias_ptrs.push_back(&biases[local_exp_label * out_dim]);
+      }
+    }
+  }
+
+  int i = 0, s = 0;
+  for (auto it : num_t_per_exp) {
+    int num_t = it.second;
+    s += num_t;
+    /* if (num_assignments_per_expert_cpu[i] != num_t) {
+      std::cout << "num_assignments_per_expert_cpu: ";
+      for (int j=0; j<num_assignments_per_expert_cpu.size(); j++) {
+        std::cout << num_assignments_per_expert_cpu[j] << " ";
+      }
+      std::cout << std::endl;
+      std::cout << "num_t_per_exp: ";
+      for (auto it2 : num_t_per_exp) {
+        std::cout << "(" << it2.first << ", " << it2.second << ") ";
+      }
+      std::cout << std::endl;
+      std::cout << "expert capacity: " << expert_capacity << std::endl;
+    }
+    assert(num_assignments_per_expert_cpu[i] == num_t); */
+    i++;
+  }
+  assert(s == gemm_batch_count);
+  assert(token_ptrs.size() == gemm_batch_count &&
+         weight_ptrs.size() == gemm_batch_count &&
+         coefficient_ptrs.size() == gemm_batch_count &&
+         output_ptrs.size() == gemm_batch_count);
+  if (use_bias) {
+    assert(bias_ptrs.size() == gemm_batch_count);
+  }
+
+  std::vector<float const *> token_ptrs_sorted(token_ptrs.size()),
+      weight_ptrs_sorted(weight_ptrs.size()),
+      bias_ptrs_sorted(bias_ptrs.size()),
+      coefficient_ptrs_sorted(coefficient_ptrs.size());
+  std::vector<float *> output_ptrs_sorted(output_ptrs.size());
+  std::copy(token_ptrs.begin(), token_ptrs.end(), token_ptrs_sorted.begin());
+  std::sort(token_ptrs_sorted.begin(), token_ptrs_sorted.end());
+  std::copy(weight_ptrs.begin(), weight_ptrs.end(), weight_ptrs_sorted.begin());
+  std::sort(weight_ptrs_sorted.begin(), weight_ptrs_sorted.end());
+  std::copy(bias_ptrs.begin(), bias_ptrs.end(), bias_ptrs_sorted.begin());
+  std::sort(bias_ptrs_sorted.begin(), bias_ptrs_sorted.end());
+  std::copy(coefficient_ptrs.begin(),
+            coefficient_ptrs.end(),
+            coefficient_ptrs_sorted.begin());
+  std::sort(coefficient_ptrs_sorted.begin(), coefficient_ptrs_sorted.end());
+  std::copy(output_ptrs.begin(), output_ptrs.end(), output_ptrs_sorted.begin());
+  std::sort(output_ptrs_sorted.begin(), output_ptrs_sorted.end());
+
+  // Download
+  float const **token_idx_array_thrust =
+      (float const **)calloc(gemm_batch_count, sizeof(float const *));
+  assert(token_idx_array_thrust);
+  checkCUDA(cudaMemcpy(token_idx_array_thrust,
+                       m->token_idx_array,
+                       sizeof(float const *) * gemm_batch_count,
+                       cudaMemcpyDeviceToHost));
+  std::vector<float const *> token_idx_array_thrust_vec(
+      token_idx_array_thrust, token_idx_array_thrust + gemm_batch_count);
+  float const **weight_idx_array_thrust =
+      (float const **)calloc(gemm_batch_count, sizeof(float const *));
+  assert(weight_idx_array_thrust);
+  checkCUDA(cudaMemcpy(weight_idx_array_thrust,
+                       m->weight_idx_array1,
+                       sizeof(float const *) * gemm_batch_count,
+                       cudaMemcpyDeviceToHost));
+  std::vector<float const *> weight_idx_array_thrust_vec(
+      weight_idx_array_thrust, weight_idx_array_thrust + gemm_batch_count);
+  float const **coefficient_idx_array_thrust =
+      (float const **)calloc(gemm_batch_count, sizeof(float const *));
+  assert(coefficient_idx_array_thrust);
+  checkCUDA(cudaMemcpy(coefficient_idx_array_thrust,
+                       m->coefficient_idx_array,
+                       sizeof(float const *) * gemm_batch_count,
+                       cudaMemcpyDeviceToHost));
+  std::vector<float const *> coefficient_idx_array_thrust_vec(
+      coefficient_idx_array_thrust,
+      coefficient_idx_array_thrust + gemm_batch_count);
+  float const **bias_idx_array_thrust =
+      (float const **)calloc(gemm_batch_count, sizeof(float const *));
+  assert(bias_idx_array_thrust);
+  if (use_bias) {
+    checkCUDA(cudaMemcpy(bias_idx_array_thrust,
+                         m->bias_idx_array1,
+                         sizeof(float const *) * gemm_batch_count,
+                         cudaMemcpyDeviceToHost));
+  }
+  std::vector<float const *> bias_idx_array_thrust_vec(
+      bias_idx_array_thrust, bias_idx_array_thrust + gemm_batch_count);
+  float **output_idx_array_thrust =
+      (float **)calloc(gemm_batch_count, sizeof(float *));
+  assert(output_idx_array_thrust);
+  checkCUDA(cudaMemcpy(output_idx_array_thrust,
+                       m->output_idx_array,
+                       sizeof(float *) * gemm_batch_count,
+                       cudaMemcpyDeviceToHost));
+  std::vector<float *> output_idx_array_thrust_vec(
+      output_idx_array_thrust, output_idx_array_thrust + gemm_batch_count);
+
+  std::vector<float const *> token_idx_array_thrust_vec_sorted(
+      token_idx_array_thrust_vec.size()),
+      weight_idx_array_thrust_vec_sorted(weight_idx_array_thrust_vec.size()),
+      coefficient_idx_array_thrust_vec_sorted(
+          coefficient_idx_array_thrust_vec.size()),
+      bias_idx_array_thrust_vec_sorted(bias_idx_array_thrust_vec.size());
+  std::vector<float *> output_idx_array_thrust_vec_sorted(
+      output_idx_array_thrust_vec.size());
+  std::copy(token_idx_array_thrust_vec.begin(),
+            token_idx_array_thrust_vec.end(),
+            token_idx_array_thrust_vec_sorted.begin());
+  std::sort(token_idx_array_thrust_vec_sorted.begin(),
+            token_idx_array_thrust_vec_sorted.end());
+  std::copy(weight_idx_array_thrust_vec.begin(),
+            weight_idx_array_thrust_vec.end(),
+            weight_idx_array_thrust_vec_sorted.begin());
+  std::sort(weight_idx_array_thrust_vec_sorted.begin(),
+            weight_idx_array_thrust_vec_sorted.end());
+  std::copy(coefficient_idx_array_thrust_vec.begin(),
+            coefficient_idx_array_thrust_vec.end(),
+            coefficient_idx_array_thrust_vec_sorted.begin());
+  std::sort(coefficient_idx_array_thrust_vec_sorted.begin(),
+            coefficient_idx_array_thrust_vec_sorted.end());
+  std::copy(bias_idx_array_thrust_vec.begin(),
+            bias_idx_array_thrust_vec.end(),
+            bias_idx_array_thrust_vec_sorted.begin());
+  std::sort(bias_idx_array_thrust_vec_sorted.begin(),
+            bias_idx_array_thrust_vec_sorted.end());
+  std::copy(output_idx_array_thrust_vec.begin(),
+            output_idx_array_thrust_vec.end(),
+            output_idx_array_thrust_vec_sorted.begin());
+  std::sort(output_idx_array_thrust_vec_sorted.begin(),
+            output_idx_array_thrust_vec_sorted.end());
+
+  if (token_ptrs_sorted != token_idx_array_thrust_vec_sorted) {
+    std::cout << "token_ptrs: ";
+    for (int i = 0; i < token_ptrs_sorted.size(); i++) {
+      std::cout << token_ptrs_sorted[i] << " ";
+    }
+    std::cout << std::endl;
+    std::cout << "token_idx_array_thrust_vec: ";
+    for (int i = 0; i < token_idx_array_thrust_vec_sorted.size(); i++) {
+      std::cout << token_idx_array_thrust_vec_sorted[i] << " ";
+    }
+    std::cout << std::endl;
+    std::cout << "Input: " << input << std::endl;
+    std::cout << "data_dim: " << data_dim << std::endl;
+    std::cout << "out_dim: " << out_dim << std::endl;
+    std::cout << "expert_start_idx: " << experts_start_idx << std::endl;
+    std::cout << "indices: ";
+    for (int i = 0; i < indices_vec.size(); i++) {
+      std::cout << indices_vec[i] << " ";
+    }
+    std::cout << std::endl;
+    std::cout << "indices_vec_sorted: ";
+    for (int i = 0; i < indices_vec_sorted.size(); i++) {
+      std::cout << indices_vec_sorted[i] << " ";
+    }
+    std::cout << std::endl;
+  }
+  assert(token_ptrs_sorted == token_idx_array_thrust_vec_sorted);
+  assert(weight_ptrs_sorted == weight_idx_array_thrust_vec_sorted);
+  if (coefficient_ptrs_sorted != coefficient_idx_array_thrust_vec_sorted) {
+    std::cout << "coefficient_ptrs_sorted: ";
+    for (int i = 0; i < coefficient_ptrs_sorted.size(); i++) {
+      std::cout << coefficient_ptrs_sorted[i] << " ";
+    }
+    std::cout << std::endl;
+    std::cout << "coefficient_idx_array_thrust_vec_sorted: ";
+    for (int i = 0; i < coefficient_idx_array_thrust_vec_sorted.size(); i++) {
+      std::cout << coefficient_idx_array_thrust_vec_sorted[i] << " ";
+    }
+    std::cout << std::endl;
+    std::cout << "topk_gate_preds: " << topk_gate_preds << std::endl;
+    std::cout << "data_dim: " << data_dim << std::endl;
+    std::cout << "out_dim: " << out_dim << std::endl;
+    std::cout << "expert_start_idx: " << experts_start_idx << std::endl;
+    std::cout << "indices: ";
+    for (int i = 0; i < indices_vec.size(); i++) {
+      std::cout << indices_vec[i] << " ";
+    }
+    std::cout << std::endl;
+    std::cout << "indices_vec_sorted: ";
+    for (int i = 0; i < indices_vec_sorted.size(); i++) {
+      std::cout << indices_vec_sorted[i] << " ";
+    }
+    std::cout << std::endl;
+  }
+  assert(coefficient_ptrs_sorted == coefficient_idx_array_thrust_vec_sorted);
+  if (use_bias) {
+    assert(bias_ptrs_sorted == bias_idx_array_thrust_vec_sorted);
+  }
+  assert(output_ptrs_sorted == output_idx_array_thrust_vec_sorted);
+
+  assert(token_ptrs_sorted.size() == gemm_batch_count &&
+         weight_ptrs_sorted.size() == gemm_batch_count &&
+         coefficient_ptrs_sorted.size() == gemm_batch_count &&
+         (!use_bias || bias_ptrs_sorted.size() == gemm_batch_count) &&
+         output_ptrs_sorted.size() == gemm_batch_count);
+
+  for (int i = 0; i < token_ptrs_sorted.size(); i++) {
+    assert(token_ptrs_sorted[i]);
+    assert(weight_ptrs_sorted[i]);
+    assert(coefficient_ptrs_sorted[i]);
+    if (use_bias) {
+      assert(bias_ptrs_sorted[i]);
+    }
+    assert(output_ptrs_sorted[i]);
+  }
+
+  free(token_idx_array_thrust);
+  free(weight_idx_array_thrust);
+  free(coefficient_idx_array_thrust);
+  free(bias_idx_array_thrust);
+  free(output_idx_array_thrust);
+
+  checkCUDA(cudaFreeHost(indices_cpu));
+  indices_vec.clear();
+  indices_vec.shrink_to_fit();
+  indices_vec_sorted.clear();
+  indices_vec_sorted.shrink_to_fit();
+  num_assignments_per_expert_cpu.clear();
+  num_assignments_per_expert_cpu.shrink_to_fit();
+
+  token_ptrs.clear();
+  token_ptrs.shrink_to_fit();
+  token_ptrs_sorted.clear();
+  token_ptrs_sorted.shrink_to_fit();
+  weight_ptrs.clear();
+  weight_ptrs.shrink_to_fit();
+  weight_ptrs_sorted.clear();
+  weight_ptrs_sorted.shrink_to_fit();
+  bias_ptrs.clear();
+  bias_ptrs.shrink_to_fit();
+  bias_ptrs_sorted.clear();
+  bias_ptrs_sorted.shrink_to_fit();
+  coefficient_ptrs.clear();
+  coefficient_ptrs.shrink_to_fit();
+  output_ptrs.clear();
+  output_ptrs.shrink_to_fit();
+  output_ptrs_sorted.clear();
+  output_ptrs_sorted.shrink_to_fit();
+
+  token_idx_array_thrust_vec_sorted.clear();
+  token_idx_array_thrust_vec_sorted.shrink_to_fit();
+  weight_idx_array_thrust_vec_sorted.clear();
+  weight_idx_array_thrust_vec_sorted.shrink_to_fit();
+  coefficient_idx_array_thrust_vec_sorted.clear();
+  coefficient_idx_array_thrust_vec_sorted.shrink_to_fit();
+  bias_idx_array_thrust_vec_sorted.clear();
+  bias_idx_array_thrust_vec_sorted.shrink_to_fit();
+  output_idx_array_thrust_vec_sorted.clear();
+  output_idx_array_thrust_vec_sorted.shrink_to_fit();
+
+  // Check batch output pointers
+  assert(gemm_batch_count <= m->effective_batch_size);
+  float **dev_batch_outputs_cuda = (float **)calloc(
+      num_chosen_experts * m->effective_batch_size, sizeof(float *));
+  assert(dev_batch_outputs_cuda);
+  checkCUDA(
+      cudaMemcpy(dev_batch_outputs_cuda,
+                 m->dev_batch_outputs1,
+                 sizeof(float *) * num_chosen_experts * m->effective_batch_size,
+                 cudaMemcpyDeviceToHost));
+  std::vector<float *> dev_batch_outputs_cuda_vec(
+      dev_batch_outputs_cuda,
+      dev_batch_outputs_cuda + num_chosen_experts * m->effective_batch_size);
+
+  std::vector<float *> batch_outputs_host_vec(
+      m->batch_outputs1,
+      m->batch_outputs1 + num_chosen_experts * m->effective_batch_size);
+  assert(batch_outputs_host_vec == dev_batch_outputs_cuda_vec);
+
+  /* std::cout << "dev_batch_outputs_cuda_vec[i]: ";
+  for (int i=0; i<dev_batch_outputs_cuda_vec.size(); i++) {
+    assert(dev_batch_outputs_cuda_vec[i]);
+    if (i>0) {
+      assert(dev_batch_outputs_cuda_vec[i] == dev_batch_outputs_cuda_vec[i-1] +
+  out_dim);
+    }
+    std::cout << dev_batch_outputs_cuda_vec[i] << " ";
+  }
+  std::cout << std::endl; */
+
+  free(dev_batch_outputs_cuda);
+#endif
+
+  experts_forward_GemmBatched_kernel(m,
+                                     (void const **)m->weight_idx_array1,
+                                     (void const **)m->weight_idx_array2,
+                                     (void const **)m->token_idx_array,
+                                     (void **)m->dev_batch_outputs1,
+                                     (void **)m->dev_batch_outputs2,
+                                     (void const **)m->bias_idx_array1,
+                                     (void const **)m->bias_idx_array2,
+                                     activation,
+                                     data_dim,
+                                     out_dim,
+                                     m->experts_num_layers,
+                                     m->experts_internal_dim_size,
+                                     num_tokens,
+                                     num_chosen_experts,
+                                     gemm_batch_count,
+                                     stream);
+
+  // checkCUDA(cudaStreamSynchronize(stream));
+
+  int aggregation_parallelism =
+      std::max(num_tokens, gemm_batch_count) * out_dim;
+  experts_forward_aggregate_kernel<<<GET_BLOCKS(aggregation_parallelism),
+                                     min(CUDA_NUM_THREADS,
+                                         (int)aggregation_parallelism),
+                                     0,
+                                     stream>>>(num_tokens,
+                                               gemm_batch_count,
+                                               out_dim,
+                                               output,
+                                               m->experts_num_layers == 1
+                                                   ? m->dev_batch_outputs1
+                                                   : m->dev_batch_outputs2,
+                                               m->coefficient_idx_array,
+                                               m->output_idx_array);
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("[Experts] forward time = %.2lfms\n", elapsed);
+  }
+}
+
+ExpertsMeta::ExpertsMeta(FFHandler handler,
+                         int _num_experts,
+                         int _experts_start_idx,
+                         int _data_dim,
+                         int _out_dim,
+                         int _experts_num_layers,
+                         int _experts_internal_dim_size,
+                         int _effective_batch_size,
+                         int _num_chosen_experts,
+                         float _alpha,
+                         bool _use_bias,
+                         ActiMode _activation)
+    : OpMeta(handler), num_experts(_num_experts),
+      experts_start_idx(_experts_start_idx), data_dim(_data_dim),
+      out_dim(_out_dim), experts_num_layers(_experts_num_layers),
+      experts_internal_dim_size(_experts_internal_dim_size),
+      effective_batch_size(_effective_batch_size),
+      num_chosen_experts(_num_chosen_experts), alpha(_alpha),
+      use_bias(_use_bias), activation(_activation) {
+  expert_capacity =
+      ceil(alpha * num_chosen_experts / num_experts * effective_batch_size);
+
+  checkCUDA(
+      cudaMalloc(&sorted_indices,
+                 num_chosen_experts * effective_batch_size * sizeof(int)));
+  checkCUDA(
+      cudaMalloc(&original_indices,
+                 num_chosen_experts * effective_batch_size * sizeof(int)));
+  checkCUDA(cudaMalloc(&non_zero_expert_labels, num_experts * sizeof(int)));
+  checkCUDA(cudaMalloc(
+      &temp_sequence,
+      std::max(num_experts, num_chosen_experts * effective_batch_size) *
+          sizeof(int)));
+  checkCUDA(cudaMalloc(&exp_local_label_to_index, num_experts * sizeof(int)));
+  // expert_start_indexes needs one more slot to save the upper bound index.
+  // Initial sequence can require more space, though.
+  checkCUDA(cudaMalloc(
+      &expert_start_indexes,
+      std::max(num_experts + 1, num_chosen_experts * effective_batch_size) *
+          sizeof(int)));
+  checkCUDA(cudaMalloc(&num_assignments_per_expert, num_experts * sizeof(int)));
+  checkCUDA(cudaMalloc(&destination_start_indices, num_experts * sizeof(int)));
+
+  checkCUDA(
+      cudaMalloc(&token_idx_array,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  checkCUDA(
+      cudaMalloc(&weight_idx_array1,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  checkCUDA(
+      cudaMalloc(&bias_idx_array1,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  checkCUDA(
+      cudaMalloc(&coefficient_idx_array,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  checkCUDA(
+      cudaMalloc(&output_idx_array,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  batch_outputs1 = new float *[num_chosen_experts * effective_batch_size];
+  int batch_outputs1_dim =
+      (experts_num_layers == 1) ? out_dim : experts_internal_dim_size;
+  checkCUDA(cudaMalloc(&batch_outputs1[0],
+                       batch_outputs1_dim * num_chosen_experts *
+                           effective_batch_size * sizeof(float)));
+  checkCUDA(cudaMemset(batch_outputs1[0],
+                       0,
+                       batch_outputs1_dim * num_chosen_experts *
+                           effective_batch_size * sizeof(float)));
+  for (int i = 1; i < num_chosen_experts * effective_batch_size; i++) {
+    batch_outputs1[i] = batch_outputs1[i - 1] + batch_outputs1_dim;
+  }
+  checkCUDA(
+      cudaMalloc(&dev_batch_outputs1,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  checkCUDA(
+      cudaMemcpy(dev_batch_outputs1,
+                 batch_outputs1,
+                 num_chosen_experts * effective_batch_size * sizeof(float *),
+                 cudaMemcpyHostToDevice));
+  if (experts_num_layers == 2) {
+    checkCUDA(cudaMalloc(&weight_idx_array2,
+                         num_chosen_experts * effective_batch_size *
+                             sizeof(float *)));
+    checkCUDA(cudaMalloc(&bias_idx_array2,
+                         num_chosen_experts * effective_batch_size *
+                             sizeof(float *)));
+    batch_outputs2 = new float *[num_chosen_experts * effective_batch_size];
+    checkCUDA(cudaMalloc(&batch_outputs2[0],
+                         out_dim * num_chosen_experts * effective_batch_size *
+                             sizeof(float)));
+    checkCUDA(cudaMemset(batch_outputs2[0],
+                         0,
+                         out_dim * num_chosen_experts * effective_batch_size *
+                             sizeof(float)));
+    for (int i = 1; i < num_chosen_experts * effective_batch_size; i++) {
+      batch_outputs2[i] = batch_outputs2[i - 1] + out_dim;
+    }
+    checkCUDA(cudaMalloc(&dev_batch_outputs2,
+                         num_chosen_experts * effective_batch_size *
+                             sizeof(float *)));
+    checkCUDA(
+        cudaMemcpy(dev_batch_outputs2,
+                   batch_outputs2,
+                   num_chosen_experts * effective_batch_size * sizeof(float *),
+                   cudaMemcpyHostToDevice));
+  }
+  // Bias
+  float *dram_one_ptr = (float *)malloc(sizeof(float) * 1);
+  for (int i = 0; i < 1; i++) {
+    dram_one_ptr[i] = 1.0f;
+  }
+  float *fb_one_ptr;
+  checkCUDA(cudaMalloc(&fb_one_ptr, sizeof(float) * 1));
+  checkCUDA(cudaMemcpy(
+      fb_one_ptr, dram_one_ptr, sizeof(float) * 1, cudaMemcpyHostToDevice));
+  one_ptr = (float const *)fb_one_ptr;
+  free((void *)dram_one_ptr);
+  checkCUDA(
+      cudaMalloc(&one_ptr_array,
+                 num_chosen_experts * effective_batch_size * sizeof(float *)));
+  for (int i = 0; i < num_chosen_experts * effective_batch_size; i++) {
+    checkCUDA(cudaMemcpy(&one_ptr_array[i],
+                         &fb_one_ptr,
+                         sizeof(float *),
+                         cudaMemcpyHostToDevice));
+  }
+  // Activation
+  checkCUDNN(cudnnCreateActivationDescriptor(&actiDesc));
+  checkCUDNN(cudnnCreateTensorDescriptor(&resultTensorDesc1));
+  if (experts_num_layers == 2) {
+    checkCUDNN(cudnnCreateTensorDescriptor(&resultTensorDesc2));
+  }
+  if (use_activation(activation)) {
+    cudnnActivationMode_t mode;
+    switch (activation) {
+      case AC_MODE_RELU:
+        mode = CUDNN_ACTIVATION_RELU;
+        break;
+      case AC_MODE_SIGMOID:
+        mode = CUDNN_ACTIVATION_SIGMOID;
+        break;
+      default:
+        // Unsupported activation mode
+        assert(false);
+    }
+    checkCUDNN(
+        cudnnSetActivationDescriptor(actiDesc, mode, CUDNN_PROPAGATE_NAN, 0.0));
+    if (experts_num_layers == 1) {
+      checkCUDNN(
+          cudnnSetTensor4dDescriptor(resultTensorDesc1,
+                                     CUDNN_TENSOR_NCHW,
+                                     // CUDNN_DATA_FLOAT,
+                                     cuda_to_cudnn_datatype(CUDA_R_32F),
+                                     num_chosen_experts * effective_batch_size,
+                                     out_dim,
+                                     1,
+                                     1));
+    } else {
+      checkCUDNN(
+          cudnnSetTensor4dDescriptor(resultTensorDesc1,
+                                     CUDNN_TENSOR_NCHW,
+                                     // CUDNN_DATA_FLOAT,
+                                     cuda_to_cudnn_datatype(CUDA_R_32F),
+                                     num_chosen_experts * effective_batch_size,
+                                     experts_internal_dim_size,
+                                     1,
+                                     1));
+      checkCUDNN(
+          cudnnSetTensor4dDescriptor(resultTensorDesc2,
+                                     CUDNN_TENSOR_NCHW,
+                                     // CUDNN_DATA_FLOAT,
+                                     cuda_to_cudnn_datatype(CUDA_R_32F),
+                                     num_chosen_experts * effective_batch_size,
+                                     out_dim,
+                                     1,
+                                     1));
+    }
+  }
+}
+ExpertsMeta::~ExpertsMeta(void) {
+
+  checkCUDA(cudaFree(sorted_indices));
+  checkCUDA(cudaFree(original_indices));
+  checkCUDA(cudaFree(non_zero_expert_labels));
+  checkCUDA(cudaFree(temp_sequence));
+  checkCUDA(cudaFree(exp_local_label_to_index));
+  checkCUDA(cudaFree(expert_start_indexes));
+  checkCUDA(cudaFree(num_assignments_per_expert));
+  checkCUDA(cudaFree(destination_start_indices));
+  checkCUDA(cudaFree(token_idx_array));
+  checkCUDA(cudaFree(weight_idx_array1));
+  checkCUDA(cudaFree(weight_idx_array2));
+  checkCUDA(cudaFree(coefficient_idx_array));
+  checkCUDA(cudaFree(output_idx_array));
+  checkCUDA(cudaFree(dev_batch_outputs1));
+  checkCUDA(cudaFree(dev_batch_outputs2));
+  checkCUDA(cudaFree(bias_idx_array1));
+  checkCUDA(cudaFree(bias_idx_array2));
+  checkCUDA(cudaFree(batch_outputs1[0]));
+  checkCUDA(cudaFree(batch_outputs2[0]));
+  delete[] batch_outputs1;
+  delete[] batch_outputs2;
+  // Bias
+  checkCUDA(cudaFree((void *)one_ptr));
+  checkCUDA(cudaFree((void *)one_ptr_array));
+  // Activation
+  checkCUDNN(cudnnDestroyActivationDescriptor(actiDesc));
+  checkCUDNN(cudnnDestroyTensorDescriptor(resultTensorDesc1));
+  checkCUDNN(cudnnDestroyTensorDescriptor(resultTensorDesc2));
+}
+
+}; // namespace FlexFlow
diff --git a/src/ops/fused.cc b/src/ops/fused.cc
index 3dc442708f..1d5db2f461 100644
--- a/src/ops/fused.cc
+++ b/src/ops/fused.cc
@@ -100,6 +100,7 @@ FusedOp::FusedOp(FFModel &model, Op *op)
   op_num_outputs[0] = op->numOutputs;
   op_op_type[0] = op->op_type;
   operators[0] = op;
+  layer_guid = op->layer_guid;
   // for (int i = 0; i < numInputs; i++) {
   //   op_input_source[i] = SOURCE_INPUT;
   //   op_input_idx[i] = i;
@@ -127,9 +128,9 @@ bool FusedOp::add_operator(FFModel &model, Op *op) {
   // assert(model.config.find_parallel_config(my_domain.get_dim(), name,
   // my_config)); assert(model.config.find_parallel_config(op_domain.get_dim(),
   // op->name, op_config));
-  // Cannot fuse parallel operators since they have different paralel_is
-  // in forward and backward
-  assert(!op->is_parallel_op());
+  // Cannot fuse parallel operators (except allreduce) since they have different
+  // paralel_is in forward and backward
+  assert(!op->is_parallel_op() || op->op_type == OP_ALLREDUCE);
   // Currently don't consider nested fusion
   assert(op->op_type != OP_FUSED);
   MachineView my_view = outputs[0]->machine_view;
@@ -149,12 +150,14 @@ bool FusedOp::add_operator(FFModel &model, Op *op) {
       (weight_offset + op->numWeights > MAX_NUM_FUSED_TENSORS) ||
       (output_offset + op->numOutputs > MAX_NUM_FUSED_TENSORS)) {
     fprintf(stderr, "Cannot fuse. Consider increase MAX_NUM_FUSED_TENSORS\n");
+    assert(false);
     return false;
   }
   if (numOperators + 1 > MAX_NUM_FUSED_OPERATORS) {
     fprintf(
         stderr,
         "Reach to the fusion limit. Consider increase MAX_NUM_FUSED_OPERATORS");
+    assert(false);
     return false;
   }
   // Set inputs
@@ -331,6 +334,92 @@ void FusedOp::init(FFModel const &ff) {
   }
 }
 
+void FusedOp::init_inference(FFModel const &ff,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  // Call init methods in individual operators
+  Domain domain = runtime->get_index_space_domain(ctx, parallel_is);
+  int ioff = 0, ooff = 0;
+  for (int op = 0; op < numOperators; op++) {
+    // prepare batch_inputs, batch_outputs for operators[i]
+    std::vector<ParallelTensor> my_batch_inputs;
+    std::vector<ParallelTensor> my_batch_outputs;
+    for (int i = 0; i < op_num_inputs[op]; i++) {
+      int my_off = op_input_idx[i + ioff];
+      if (op_input_source[i + ioff] == SOURCE_INPUT) {
+        my_batch_inputs.push_back(batch_inputs[my_off]);
+      } else if (op_input_source[i + ioff] == SOURCE_OUTPUT) {
+        my_batch_inputs.push_back(batch_outputs[my_off]);
+      } else {
+        assert(false);
+      }
+    }
+    for (int i = 0; i < op_num_outputs[op]; i++) {
+      assert(op_output_source[i + ooff] == SOURCE_OUTPUT);
+      my_batch_outputs.push_back(batch_outputs[i + ooff]);
+    }
+    ioff += op_num_inputs[op];
+    ooff += op_num_outputs[op];
+    operators[op]->init_inference(ff, my_batch_inputs, my_batch_outputs, mv);
+    for (size_t j = 0; j < domain.get_volume(); j++) {
+      fused_meta[j].meta[op] =
+          operators[op]->inference_meta[my_batch_outputs[0]][j];
+    }
+  }
+  for (size_t j = 0; j < domain.get_volume(); j++) {
+    fused_meta[j].numOperators = numOperators;
+  }
+  switch (domain.get_dim()) {
+#define DIMFUNC(DIM)                                                           \
+  case DIM: {                                                                  \
+    Rect<DIM> rect = domain;                                                   \
+    int idx = 0;                                                               \
+    for (PointInRectIterator<DIM> it(rect); it(); it++) {                      \
+      argmap.set_point(*it,                                                    \
+                       TaskArgument(&fused_meta[idx++], sizeof(FusedOpMeta))); \
+    }                                                                          \
+    break;                                                                     \
+  }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+    default:
+      assert(false);
+  }
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  IndexLauncher launcher(FUSEDOP_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(FusedOp)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  switch (domain.get_dim()) {
+#define DIMFUNC(DIM)                                                           \
+  case DIM: {                                                                  \
+    Rect<DIM> rect = domain;                                                   \
+    int idx = 0;                                                               \
+    for (PointInRectIterator<DIM> it(rect); it(); it++) {                      \
+      inference_meta[batch_outputs[0]][idx++] = fm.get_result<OpMeta *>(*it);  \
+    }                                                                          \
+    break;                                                                     \
+  }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+    default:
+      assert(false);
+  }
+}
+
 void FusedOp::forward(FFModel const &ff) {
   // Set iter_config
   iter_config = ff.iter_config;
@@ -380,6 +469,67 @@ void FusedOp::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap FusedOp::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  // Set iter_config
+  iter_config = ff.iter_config;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  // bc is one of BatchConfig, TreeVerifyBatchConfig, and BeamSearchBatchConfig
+  // so we transfer the maximum of them
+  // size_t batch_config_size =
+  //    std::max(sizeof(TreeVerifyBatchConfig), sizeof(BeamSearchBatchConfig));
+  IndexLauncher launcher(FUSEDOP_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  int offset = 0;
+  for (int i = 0; i < numInputs; i++) {
+    assert(inputs[i]->part != LogicalPartition::NO_PART);
+    assert(inputs[i]->region != LogicalRegion::NO_REGION);
+    launcher.add_region_requirement(RegionRequirement(batch_inputs[i]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      batch_inputs[i]->region));
+    launcher.add_field(offset + i, FID_DATA);
+  }
+  offset += numInputs;
+  for (int i = 0; i < numWeights; i++) {
+    assert(weights[i]->region != LogicalRegion::NO_REGION);
+    launcher.add_region_requirement(RegionRequirement(weights[i]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[i]->region));
+    launcher.add_field(offset + i, FID_DATA);
+  }
+  offset += numWeights;
+  for (int i = 0; i < numOutputs; i++) {
+    assert(outputs[i]->region != LogicalRegion::NO_REGION);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[i]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[i]->region));
+    launcher.add_field(offset + i, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void FusedOp::backward(FFModel const &ff) {
   // Set iter_config
   iter_config = ff.iter_config;
diff --git a/src/ops/fused.cpp b/src/ops/fused.cpp
index a602c5d6b1..c717881e66 100644
--- a/src/ops/fused.cpp
+++ b/src/ops/fused.cpp
@@ -14,20 +14,29 @@
  */
 
 #include "flexflow/ops/fused.h"
+#include "flexflow/accessor.h"
 #include "flexflow/model.h"
 #include "flexflow/ops/batch_norm.h"
 #include "flexflow/ops/element_unary.h"
+#include "flexflow/ops/embedding.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
 #include "flexflow/ops/kernels/batch_matmul_kernels.h"
 #include "flexflow/ops/kernels/concat_kernels.h"
 #include "flexflow/ops/kernels/conv_2d_kernels.h"
 #include "flexflow/ops/kernels/dropout_kernels.h"
 #include "flexflow/ops/kernels/element_binary_kernels.h"
+#include "flexflow/ops/kernels/embedding_kernels.h"
 #include "flexflow/ops/kernels/flat_kernels.h"
 #include "flexflow/ops/kernels/linear_kernels.h"
 #include "flexflow/ops/kernels/pool_2d_kernels.h"
 #include "flexflow/ops/kernels/reshape_kernels.h"
+#include "flexflow/ops/kernels/rms_norm_kernels.h"
 #include "flexflow/ops/kernels/transpose_kernels.h"
+#include "flexflow/ops/layer_norm.h"
 #include "flexflow/ops/linear.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/parallel_ops/kernels/allreduce_kernels.h"
 #include "flexflow/utils/hip_helper.h"
 #include <hip/hip_runtime.h>
 
@@ -284,11 +293,10 @@ __host__ void FusedOp::forward_task(Task const *task,
         assert(my_input_accessor[0].domain == my_input_accessor[1].domain);
         assert(my_input_accessor[0].domain == my_output_accessor[0].domain);
         ElementBinaryMeta *m = (ElementBinaryMeta *)metas->meta[op];
-        Kernels::ElementBinary::forward_kernel_wrapper(
-            m,
-            my_input_accessor[0].get_float_ptr(),
-            my_input_accessor[1].get_float_ptr(),
-            my_output_accessor[0].get_float_ptr());
+        Kernels::ElementBinary::forward_kernel_wrapper(m,
+                                                       my_input_accessor[0],
+                                                       my_input_accessor[1],
+                                                       my_output_accessor[0]);
         break;
         break;
       }
@@ -374,6 +382,414 @@ __host__ void FusedOp::forward_task(Task const *task,
   //   "[Fused:forward:output]");
 }
 
+/*
+  regions[...](I): inputs
+  regions[...](I): weights
+  regions[...](I): outputs
+*/
+__host__ void
+    FusedOp::inference_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  // const FusedOp* fused = (FusedOp*) task->args;
+  FusedOpMeta const *metas = *((FusedOpMeta **)task->local_args);
+  FusedOp const *fused = metas->fused_op;
+  BatchConfig const *bc = (BatchConfig *)task->args;
+  assert(metas->numOperators == fused->numOperators);
+  assert(regions.size() == task->regions.size());
+  assert((int)regions.size() ==
+         fused->numInputs + fused->numWeights + fused->numOutputs);
+  GenericTensorAccessorR input_accessor[MAX_NUM_INPUTS];
+  GenericTensorAccessorR weight_accessor[MAX_NUM_WEIGHTS];
+  GenericTensorAccessorW output_accessor[MAX_NUM_OUTPUTS];
+  assert(fused->numInputs <= MAX_NUM_INPUTS);
+  for (int i = 0; i < fused->numInputs; i++) {
+    input_accessor[i] =
+        helperGetGenericTensorAccessorRO(fused->input_data_types[i],
+                                         regions[i],
+                                         task->regions[i],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime);
+  }
+  int roff = fused->numInputs;
+  assert(fused->numWeights <= MAX_NUM_WEIGHTS);
+  for (int i = 0; i < fused->numWeights; i++) {
+    weight_accessor[i] =
+        helperGetGenericTensorAccessorRO(fused->weight_data_types[i],
+                                         regions[i + roff],
+                                         task->regions[i + roff],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime);
+  }
+  roff += fused->numWeights;
+  assert(fused->numOutputs <= MAX_NUM_OUTPUTS);
+  for (int i = 0; i < fused->numOutputs; i++) {
+    output_accessor[i] =
+        helperGetGenericTensorAccessorWO(fused->output_data_types[i],
+                                         regions[i + roff],
+                                         task->regions[i + roff],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime);
+  }
+  // Assert that all meta share the same dnn/blas handler
+  int start = 0;
+  for (start = 0; start < fused->numOperators; start++) {
+    if (metas->meta[start] != NULL) {
+      break;
+    }
+  }
+  for (int op = start + 1; op < fused->numOperators; op++) {
+    if (metas->meta[op] != NULL) {
+      assert(metas->meta[start]->handle.blas == metas->meta[op]->handle.blas);
+      assert(metas->meta[start]->handle.dnn == metas->meta[op]->handle.dnn);
+    }
+  }
+
+  hipStream_t stream;
+  if (start < fused->numOperators) {
+    checkCUDA(get_legion_stream(&stream));
+  }
+
+  int ioff = 0, woff = 0, ooff = 0;
+  for (int op = 0; op < fused->numOperators; op++) {
+    GenericTensorAccessorR my_input_accessor[MAX_NUM_INPUTS];
+    GenericTensorAccessorR my_weight_accessor[MAX_NUM_WEIGHTS];
+    GenericTensorAccessorW my_output_accessor[MAX_NUM_OUTPUTS];
+    for (int i = 0; i < fused->op_num_inputs[op]; i++) {
+      int my_off = fused->op_input_idx[i + ioff];
+      if (fused->op_input_source[i + ioff] == SOURCE_INPUT) {
+        my_input_accessor[i] = input_accessor[my_off];
+      } else if (fused->op_input_source[i + ioff] == SOURCE_OUTPUT) {
+        my_input_accessor[i] = output_accessor[my_off];
+      } else {
+        assert(false);
+      }
+    }
+    for (int i = 0; i < fused->op_num_weights[op]; i++) {
+      assert(fused->op_weight_source[i + woff] == SOURCE_WEIGHT);
+      my_weight_accessor[i] = weight_accessor[fused->op_weight_idx[i + woff]];
+    }
+    for (int i = 0; i < fused->op_num_outputs[op]; i++) {
+      assert(fused->op_output_source[i + ooff] == SOURCE_OUTPUT);
+      my_output_accessor[i] = output_accessor[i + ooff];
+    }
+    switch (fused->op_op_type[op]) {
+      case OP_CONCAT: {
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        ConcatMeta *m = (ConcatMeta *)metas->meta[op];
+        int num_inputs = fused->op_num_inputs[op];
+        Kernels::Concat::forward_kernel_wrapper(m,
+                                                my_output_accessor[0],
+                                                my_input_accessor,
+                                                num_inputs,
+                                                m->legion_axis);
+        break;
+      }
+      case OP_BATCHNORM: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain.get_dim() == 5);
+        assert(my_output_accessor[0].domain.get_dim() == 5);
+        assert(my_weight_accessor[0].domain.get_dim() == 2);
+        assert(my_weight_accessor[1].domain.get_dim() == 2);
+        BatchNormMeta *m = (BatchNormMeta *)metas->meta[op];
+        BatchNorm::forward_kernel(m,
+                                  my_input_accessor[0].get_float_ptr(),
+                                  my_output_accessor[0].get_float_ptr(),
+                                  my_weight_accessor[0].get_float_ptr(),
+                                  my_weight_accessor[1].get_float_ptr());
+        break;
+      }
+      case OP_LINEAR: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        Domain kernel_domain = my_weight_accessor[0].domain;
+        int in_dim = kernel_domain.hi()[0] - kernel_domain.lo()[0] + 1;
+        int out_dim = kernel_domain.hi()[1] - kernel_domain.lo()[1] + 1;
+        int batch_size = my_input_accessor[0].domain.get_volume() / in_dim;
+        assert(my_output_accessor[0].domain.get_volume() ==
+               out_dim * batch_size);
+        assert(my_input_accessor[0].domain.get_volume() == in_dim * batch_size);
+        void const *bias_ptr = nullptr;
+        if (fused->op_num_weights[op] == 2) {
+          assert(my_weight_accessor[1].domain.get_volume() == out_dim);
+          bias_ptr = my_weight_accessor[1].ptr;
+        } else {
+          assert(fused->op_num_weights[op] == 1);
+        }
+        LinearMeta *m = (LinearMeta *)metas->meta[op];
+        assert(m->input_type[0] == my_input_accessor[0].data_type);
+        assert(m->input_type[0] == my_output_accessor[0].data_type);
+        batch_size = bc->num_active_tokens();
+        Kernels::Linear::forward_kernel_wrapper(m,
+                                                my_input_accessor[0].ptr,
+                                                my_output_accessor[0].ptr,
+                                                my_weight_accessor[0].ptr,
+                                                bias_ptr,
+                                                in_dim,
+                                                out_dim,
+                                                batch_size);
+        break;
+      }
+      case OP_BATCHMATMUL: {
+        assert(fused->op_num_inputs[op] == 2);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        Domain out_domain = my_output_accessor[0].domain;
+        Domain a_domain = my_input_accessor[0].domain;
+        Domain b_domain = my_input_accessor[1].domain;
+        int m = b_domain.hi()[0] - b_domain.lo()[0] + 1;
+        assert(m == out_domain.hi()[0] - out_domain.lo()[0] + 1);
+        int n = a_domain.hi()[1] - a_domain.lo()[1] + 1;
+        assert(n == out_domain.hi()[1] - out_domain.lo()[1] + 1);
+        int k = a_domain.hi()[0] - a_domain.lo()[0] + 1;
+        assert(k == b_domain.hi()[1] - b_domain.lo()[1] + 1);
+        assert(a_domain.get_dim() == b_domain.get_dim());
+        assert(a_domain.get_dim() == out_domain.get_dim());
+        int batch = 1;
+        for (int i = 2; i < a_domain.get_dim(); i++) {
+          int dim_size = a_domain.hi()[i] - a_domain.lo()[i] + 1;
+          assert(dim_size == b_domain.hi()[i] - b_domain.lo()[i] + 1);
+          assert(dim_size == out_domain.hi()[i] - out_domain.lo()[i] + 1);
+          batch *= dim_size;
+        }
+        BatchMatmulMeta *meta = (BatchMatmulMeta *)metas->meta[op];
+        Kernels::BatchMatmul::forward_kernel_wrapper(
+            meta,
+            my_output_accessor[0].get_float_ptr(),
+            my_input_accessor[0].get_float_ptr(),
+            my_input_accessor[1].get_float_ptr(),
+            (float const *)nullptr,
+            m,
+            n,
+            k,
+            batch,
+            meta->a_seq_length_dim,
+            meta->b_seq_length_dim,
+            fused->iter_config.seq_length);
+        break;
+      }
+      case OP_EW_ADD:
+      case OP_EW_SUB:
+      case OP_EW_MUL:
+      case OP_EW_DIV:
+      case OP_EW_MAX:
+      case OP_EW_MIN: {
+        assert(fused->op_num_inputs[op] == 2);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain == my_input_accessor[1].domain);
+        assert(my_input_accessor[0].domain == my_output_accessor[0].domain);
+        ElementBinaryMeta *m = (ElementBinaryMeta *)metas->meta[op];
+        Kernels::ElementBinary::forward_kernel_wrapper(m,
+                                                       my_input_accessor[0],
+                                                       my_input_accessor[1],
+                                                       my_output_accessor[0]);
+        break;
+        break;
+      }
+      case OP_EMBEDDING: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        EmbeddingMeta *m = (EmbeddingMeta *)metas->meta[op];
+        if (m->aggr == AGGR_MODE_NONE) {
+          // assert(kernel_domain.get_dim() == 2);
+          assert(my_input_accessor[0].domain.get_dim() + 1 ==
+                 my_output_accessor[0].domain.get_dim());
+          for (size_t i = 0; i < my_input_accessor[0].domain.get_dim(); i++) {
+            assert(my_input_accessor[0].domain.hi()[i] ==
+                   my_output_accessor[0].domain.hi()[i + 1]);
+            assert(my_input_accessor[0].domain.lo()[i] ==
+                   my_output_accessor[0].domain.lo()[i + 1]);
+          }
+          assert(my_weight_accessor[0].domain.hi()[0] -
+                     my_weight_accessor[0].domain.lo()[0] ==
+                 my_output_accessor[0].domain.hi()[0] -
+                     my_output_accessor[0].domain.lo()[0]);
+        } else {
+          assert(my_input_accessor[0].domain.get_dim() ==
+                 my_output_accessor[0].domain.get_dim());
+          for (size_t i = 1; i < my_input_accessor[0].domain.get_dim(); i++) {
+            assert(my_input_accessor[0].domain.hi()[i] ==
+                   my_output_accessor[0].domain.hi()[i]);
+            assert(my_input_accessor[0].domain.lo()[i] ==
+                   my_output_accessor[0].domain.lo()[i]);
+          }
+          assert(my_weight_accessor[0].domain.hi()[0] -
+                     my_weight_accessor[0].domain.lo()[0] ==
+                 my_output_accessor[0].domain.hi()[0] -
+                     my_output_accessor[0].domain.lo()[0]);
+        }
+        int in_dim, out_dim, effective_batch_size;
+        if (m->aggr == AGGR_MODE_NONE) {
+          in_dim = 1;
+          out_dim = my_output_accessor[0].domain.hi()[0] -
+                    my_output_accessor[0].domain.lo()[0] + 1;
+          effective_batch_size =
+              my_output_accessor[0].domain.get_volume() / out_dim;
+          assert(effective_batch_size * in_dim ==
+                 my_input_accessor[0].domain.get_volume());
+        } else {
+          assert(m->aggr == AGGR_MODE_AVG || m->aggr == AGGR_MODE_SUM);
+          in_dim = my_input_accessor[0].domain.hi()[0] -
+                   my_input_accessor[0].domain.lo()[0] + 1;
+          out_dim = my_output_accessor[0].domain.hi()[0] -
+                    my_output_accessor[0].domain.lo()[0] + 1;
+          effective_batch_size =
+              my_output_accessor[0].domain.get_volume() / out_dim;
+          assert(effective_batch_size * in_dim ==
+                 my_input_accessor[0].domain.get_volume());
+        }
+
+        assert(my_input_accessor[0].data_type == DT_INT32 ||
+               my_input_accessor[0].data_type == DT_INT64);
+        Kernels::Embedding::forward_kernel_wrapper(m,
+                                                   my_input_accessor[0],
+                                                   my_output_accessor[0],
+                                                   my_weight_accessor[0],
+                                                   in_dim,
+                                                   out_dim,
+                                                   effective_batch_size);
+        break;
+      }
+      case OP_RELU:
+      case OP_SIGMOID:
+      case OP_TANH:
+      case OP_ELU: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain == my_output_accessor[0].domain);
+        ElementUnaryMeta *m = (ElementUnaryMeta *)metas->meta[op];
+        ElementUnary::forward_kernel_wrapper(
+            m,
+            my_input_accessor[0].get_float_ptr(),
+            my_output_accessor[0].get_float_ptr(),
+            my_input_accessor[0].domain.get_volume());
+        break;
+      }
+      case OP_RMS_NORM: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        RMSNormMeta const *m = (RMSNormMeta *)metas->meta[op];
+        Kernels::RMSNorm::forward_kernel_wrapper(m,
+                                                 my_input_accessor[0],
+                                                 my_weight_accessor[0],
+                                                 my_output_accessor[0]);
+        break;
+      }
+      case OP_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        IncMultiHeadSelfAttentionMeta const *m =
+            (IncMultiHeadSelfAttentionMeta *)metas->meta[op];
+        assert(fused->op_num_weights[op] == (1 + (int)(*m->bias)));
+        GenericTensorAccessorR biases;
+        if (*m->bias) {
+          assert(fused->op_num_weights[op] == 2);
+          biases = my_weight_accessor[1];
+        }
+        IncMultiHeadSelfAttention::inference_kernel_wrapper(
+            m,
+            bc,
+            task->index_point.point_data[0],
+            my_input_accessor[0],
+            my_weight_accessor[0],
+            my_output_accessor[0],
+            biases);
+        break;
+      }
+      case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        TreeIncMultiHeadSelfAttentionMeta *m =
+            (TreeIncMultiHeadSelfAttentionMeta *)metas->meta[op];
+        TreeVerifyBatchConfig const *tree_bc =
+            (TreeVerifyBatchConfig *)task->args;
+        assert(fused->op_num_weights[op] == (1 + (int)(*m->bias)));
+        GenericTensorAccessorR biases;
+        if (*m->bias) {
+          assert(fused->op_num_weights[op] == 2);
+          biases = my_weight_accessor[1];
+        }
+        TreeIncMultiHeadSelfAttention::inference_kernel_wrapper(
+            m,
+            tree_bc,
+            task->index_point.point_data[0],
+            my_input_accessor[0],
+            my_weight_accessor[0],
+            my_output_accessor[0],
+            biases);
+        break;
+      }
+      case OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        SpecIncMultiHeadSelfAttentionMeta const *m =
+            (SpecIncMultiHeadSelfAttentionMeta *)metas->meta[op];
+        BeamSearchBatchConfig const *beam_bc =
+            (BeamSearchBatchConfig *)task->args;
+        assert(fused->op_num_weights[op] == (1 + (int)(*m->bias)));
+        GenericTensorAccessorR biases;
+        if (*m->bias) {
+          assert(fused->op_num_weights[op] == 2);
+          biases = my_weight_accessor[1];
+        }
+        SpecIncMultiHeadSelfAttention::inference_kernel_wrapper(
+            m,
+            beam_bc,
+            task->index_point.point_data[0],
+            my_input_accessor[0],
+            my_weight_accessor[0],
+            my_output_accessor[0],
+            biases);
+        break;
+      }
+      case OP_LAYERNORM: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        LayerNormMeta const *m = (LayerNormMeta *)metas->meta[op];
+        assert(fused->op_num_weights[op] == 2 * (int)(m->elementwise_affine));
+        GenericTensorAccessorR gamma, beta;
+        if (m->elementwise_affine) {
+          gamma = my_weight_accessor[0];
+          beta = my_weight_accessor[1];
+        }
+        LayerNorm::forward_kernel_wrapper(
+            m, my_input_accessor[0], my_output_accessor[0], gamma, beta);
+        break;
+      }
+      case OP_ALLREDUCE: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        AllReduceMeta const *m = (AllReduceMeta *)metas->meta[op];
+        Kernels::AllReduce::forward_kernel_wrapper(
+            m, my_input_accessor[0], my_output_accessor[0]);
+        break;
+      }
+      default: {
+        fprintf(stderr,
+                "Fusion currently does not support type = %d\n",
+                fused->op_op_type[op]);
+        assert(false && "Fusion currently does not support type");
+      }
+    }
+    ioff += fused->op_num_inputs[op];
+    woff += fused->op_num_weights[op];
+    ooff += fused->op_num_outputs[op];
+  }
+  // for (int i = 0; i < fused->numOutputs; i++)
+  //   print_tensor<float>(output_ptr[i], output_domain[i].get_volume(),
+  //   "[Fused:forward:output]");
+}
+
 /*
   regions[...](I): input
   regions[...](I): weight
diff --git a/src/ops/fused.cu b/src/ops/fused.cu
index ca2a331984..b834073064 100644
--- a/src/ops/fused.cu
+++ b/src/ops/fused.cu
@@ -20,6 +20,7 @@
 #include "flexflow/ops/embedding.h"
 #include "flexflow/ops/flat.h"
 #include "flexflow/ops/fused.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
 #include "flexflow/ops/kernels/batch_matmul_kernels.h"
 #include "flexflow/ops/kernels/concat_kernels.h"
 #include "flexflow/ops/kernels/conv_2d_kernels.h"
@@ -30,7 +31,13 @@
 #include "flexflow/ops/kernels/linear_kernels.h"
 #include "flexflow/ops/kernels/pool_2d_kernels.h"
 #include "flexflow/ops/kernels/reshape_kernels.h"
+#include "flexflow/ops/kernels/rms_norm_kernels.h"
+#include "flexflow/ops/kernels/softmax_kernels.h"
 #include "flexflow/ops/kernels/transpose_kernels.h"
+#include "flexflow/ops/layer_norm.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/parallel_ops/kernels/allreduce_kernels.h"
 #include "flexflow/utils/cuda_helper.h"
 
 namespace FlexFlow {
@@ -38,8 +45,10 @@ namespace FlexFlow {
 using Legion::Context;
 using Legion::coord_t;
 using Legion::Domain;
+using Legion::Future;
 using Legion::LogicalPartition;
 using Legion::LogicalRegion;
+using Legion::Memory;
 using Legion::PhysicalRegion;
 using Legion::Runtime;
 using Legion::Task;
@@ -62,7 +71,7 @@ OpMeta *FusedOp::init_task(Task const *task,
 /*
   regions[...](I): inputs
   regions[...](I): weights
-  regions[...](I): outputs
+  regions[...](O): outputs
 */
 __host__ void FusedOp::forward_task(Task const *task,
                                     std::vector<PhysicalRegion> const &regions,
@@ -229,13 +238,15 @@ __host__ void FusedOp::forward_task(Task const *task,
                out_dim * batch_size);
         assert(my_input_accessor[0].domain.get_volume() == in_dim * batch_size);
         float const *bias_ptr = nullptr;
+        LinearMeta *m = (LinearMeta *)metas->meta[op];
         if (fused->op_num_weights[op] == 2) {
           assert(my_weight_accessor[1].domain.get_volume() == out_dim);
-          bias_ptr = my_weight_accessor[1].get_float_ptr();
+          if (!m->add_bias_only_once || task->index_point.point_data[0] == 0) {
+            bias_ptr = my_weight_accessor[1].get_float_ptr();
+          }
         } else {
           assert(fused->op_num_weights[op] == 1);
         }
-        LinearMeta *m = (LinearMeta *)metas->meta[op];
         Kernels::Linear::forward_kernel_wrapper(
             m,
             my_input_accessor[0].get_float_ptr(),
@@ -297,11 +308,10 @@ __host__ void FusedOp::forward_task(Task const *task,
         assert(my_input_accessor[0].domain == my_input_accessor[1].domain);
         assert(my_input_accessor[0].domain == my_output_accessor[0].domain);
         ElementBinaryMeta *m = (ElementBinaryMeta *)metas->meta[op];
-        Kernels::ElementBinary::forward_kernel_wrapper(
-            m,
-            my_input_accessor[0].get_float_ptr(),
-            my_input_accessor[1].get_float_ptr(),
-            my_output_accessor[0].get_float_ptr());
+        Kernels::ElementBinary::forward_kernel_wrapper(m,
+                                                       my_input_accessor[0],
+                                                       my_input_accessor[1],
+                                                       my_output_accessor[0]);
         break;
       }
       case OP_EMBEDDING: {
@@ -358,7 +368,8 @@ __host__ void FusedOp::forward_task(Task const *task,
                  my_input_accessor[0].domain.get_volume());
         }
 
-        assert(my_input_accessor[0].data_type == DT_INT64);
+        assert(my_input_accessor[0].data_type == DT_INT32 ||
+               my_input_accessor[0].data_type == DT_INT64);
         Kernels::Embedding::forward_kernel_wrapper(m,
                                                    my_input_accessor[0],
                                                    my_output_accessor[0],
@@ -368,6 +379,7 @@ __host__ void FusedOp::forward_task(Task const *task,
                                                    effective_batch_size);
         break;
       }
+      case OP_GELU:
       case OP_RELU:
       case OP_SIGMOID:
       case OP_TANH:
@@ -408,6 +420,26 @@ __host__ void FusedOp::forward_task(Task const *task,
             my_input_accessor[0].domain.get_volume());
         break;
       }
+      case OP_SOFTMAX: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain.get_volume() ==
+               my_output_accessor[0].domain.get_volume());
+        SoftmaxMeta *m = (SoftmaxMeta *)metas->meta[op];
+        if (m->input_type == DT_HALF) {
+          Kernels::Softmax::forward_kernel_wrapper(
+              m,
+              my_input_accessor[0].get_half_ptr(),
+              my_output_accessor[0].get_half_ptr());
+        } else if (m->input_type == DT_FLOAT) {
+          Kernels::Softmax::forward_kernel_wrapper(
+              m,
+              my_input_accessor[0].get_float_ptr(),
+              my_output_accessor[0].get_float_ptr());
+        }
+        break;
+      }
       case OP_RESHAPE: {
         assert(fused->op_num_inputs[op] == 1);
         assert(fused->op_num_weights[op] == 0);
@@ -451,6 +483,470 @@ __host__ void FusedOp::forward_task(Task const *task,
   //   "[Fused:forward:output]");
 }
 
+/*
+  regions[...](I): inputs
+  regions[...](I): weights
+  regions[...](O): outputs
+*/
+__host__ void
+    FusedOp::inference_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  // const FusedOp* fused = (FusedOp*) task->args;
+  FusedOpMeta const *metas = *((FusedOpMeta **)task->local_args);
+  FusedOp const *fused = metas->fused_op;
+  // BatchConfig const *bc = (BatchConfig *)task->args;
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  // Return if no active tokens
+  if (bc->num_tokens == 0) {
+    return;
+  }
+
+  assert(metas->numOperators == fused->numOperators);
+  assert(regions.size() == task->regions.size());
+  assert((int)regions.size() ==
+         fused->numInputs + fused->numWeights + fused->numOutputs);
+  // Domain input_domain[MAX_NUM_INPUTS];
+  // Domain weight_domain[MAX_NUM_WEIGHTS];
+  // Domain output_domain[MAX_NUM_OUTPUTS];
+  GenericTensorAccessorR input_accessor[MAX_NUM_INPUTS];
+  GenericTensorAccessorR weight_accessor[MAX_NUM_WEIGHTS];
+  GenericTensorAccessorW output_accessor[MAX_NUM_OUTPUTS];
+  assert(fused->numInputs <= MAX_NUM_INPUTS);
+  for (int i = 0; i < fused->numInputs; i++) {
+    // input_domain[i] = runtime->get_index_space_domain(
+    //     ctx, task->regions[i].region.get_index_space());
+    input_accessor[i] =
+        helperGetGenericTensorAccessorRO(fused->input_data_types[i],
+                                         regions[i],
+                                         task->regions[i],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime);
+  }
+  int roff = fused->numInputs;
+  assert(fused->numWeights <= MAX_NUM_WEIGHTS);
+  for (int i = 0; i < fused->numWeights; i++) {
+    // weight_domain[i] = runtime->get_index_space_domain(
+    //     ctx, task->regions[i + roff].region.get_index_space());
+    weight_accessor[i] =
+        helperGetGenericTensorAccessorRO(fused->weight_data_types[i],
+                                         regions[i + roff],
+                                         task->regions[i + roff],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime);
+  }
+  roff += fused->numWeights;
+  assert(fused->numOutputs <= MAX_NUM_OUTPUTS);
+  for (int i = 0; i < fused->numOutputs; i++) {
+    // output_domain[i] = runtime->get_index_space_domain(
+    //     ctx, task->regions[i + roff].region.get_index_space());
+    output_accessor[i] =
+        helperGetGenericTensorAccessorWO(fused->output_data_types[i],
+                                         regions[i + roff],
+                                         task->regions[i + roff],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime);
+  }
+  // Assert that all meta share the same dnn/blas handler
+  int start = 0;
+  for (start = 0; start < fused->numOperators; start++) {
+    if (metas->meta[start] != NULL) {
+      break;
+    }
+  }
+  for (int op = start + 1; op < fused->numOperators; op++) {
+    if (metas->meta[op] != NULL) {
+      assert(metas->meta[start]->handle.blas == metas->meta[op]->handle.blas);
+      assert(metas->meta[start]->handle.dnn == metas->meta[op]->handle.dnn);
+    }
+  }
+
+  int ioff = 0, woff = 0, ooff = 0;
+  for (int op = 0; op < fused->numOperators; op++) {
+    // Domain my_id[MAX_NUM_INPUTS];
+    // Domain my_wd[MAX_NUM_WEIGHTS];
+    // Domain my_od[MAX_NUM_OUTPUTS];
+    GenericTensorAccessorR my_input_accessor[MAX_NUM_INPUTS];
+    GenericTensorAccessorR my_weight_accessor[MAX_NUM_WEIGHTS];
+    GenericTensorAccessorW my_output_accessor[MAX_NUM_OUTPUTS];
+    for (int i = 0; i < fused->op_num_inputs[op]; i++) {
+      int my_off = fused->op_input_idx[i + ioff];
+      if (fused->op_input_source[i + ioff] == SOURCE_INPUT) {
+        // my_id[i] = input_domain[my_off];
+        my_input_accessor[i] = input_accessor[my_off];
+      } else if (fused->op_input_source[i + ioff] == SOURCE_OUTPUT) {
+        // my_id[i] = output_domain[my_off];
+        my_input_accessor[i] = output_accessor[my_off];
+      } else {
+        assert(false);
+      }
+    }
+    for (int i = 0; i < fused->op_num_weights[op]; i++) {
+      assert(fused->op_weight_source[i + woff] == SOURCE_WEIGHT);
+      // my_wd[i] = weight_domain[fused->op_weight_idx[i + woff]];
+      // my_wp[i] = weight_ptr[fused->op_weight_idx[i + woff]];
+      my_weight_accessor[i] = weight_accessor[fused->op_weight_idx[i + woff]];
+    }
+    for (int i = 0; i < fused->op_num_outputs[op]; i++) {
+      assert(fused->op_output_source[i + ooff] == SOURCE_OUTPUT);
+      // my_od[i] = output_domain[fused->op_output_idx[i + ooff]];
+      // my_op[i] = output_ptr[fused->op_output_idx[i + ooff]];
+      my_output_accessor[i] = output_accessor[i + ooff];
+    }
+    switch (fused->op_op_type[op]) {
+      case OP_CONCAT: {
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        ConcatMeta *m = (ConcatMeta *)metas->meta[op];
+        int num_inputs = fused->op_num_inputs[op];
+        Kernels::Concat::forward_kernel_wrapper(m,
+                                                my_output_accessor[0],
+                                                my_input_accessor,
+                                                num_inputs,
+                                                m->legion_axis);
+        break;
+      }
+      case OP_BATCHNORM: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain.get_dim() == 5);
+        assert(my_output_accessor[0].domain.get_dim() == 5);
+        assert(my_weight_accessor[0].domain.get_dim() == 2);
+        assert(my_weight_accessor[1].domain.get_dim() == 2);
+        BatchNormMeta *m = (BatchNormMeta *)metas->meta[op];
+        BatchNorm::forward_kernel(m,
+                                  my_input_accessor[0].get_float_ptr(),
+                                  my_output_accessor[0].get_float_ptr(),
+                                  my_weight_accessor[0].get_float_ptr(),
+                                  my_weight_accessor[1].get_float_ptr());
+        break;
+      }
+      case OP_LINEAR: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        Domain kernel_domain = my_weight_accessor[0].domain;
+        int in_dim = kernel_domain.hi()[0] - kernel_domain.lo()[0] + 1;
+        int out_dim = kernel_domain.hi()[1] - kernel_domain.lo()[1] + 1;
+        int batch_size = my_input_accessor[0].domain.get_volume() / in_dim;
+        assert(my_output_accessor[0].domain.get_volume() ==
+               out_dim * batch_size);
+        assert(my_input_accessor[0].domain.get_volume() == in_dim * batch_size);
+        void const *bias_ptr = nullptr;
+        LinearMeta *m = (LinearMeta *)metas->meta[op];
+        if (fused->op_num_weights[op] == 2) {
+          assert(my_weight_accessor[1].domain.get_volume() == out_dim);
+          if (!m->add_bias_only_once || task->index_point.point_data[0] == 0) {
+            bias_ptr = my_weight_accessor[1].ptr;
+          }
+        } else {
+          assert(fused->op_num_weights[op] == 1);
+        }
+        assert(m->input_type[0] == my_input_accessor[0].data_type);
+        assert(m->input_type[0] == my_output_accessor[0].data_type);
+        batch_size = bc->num_active_tokens();
+        Kernels::Linear::forward_kernel_wrapper(m,
+                                                my_input_accessor[0].ptr,
+                                                my_output_accessor[0].ptr,
+                                                my_weight_accessor[0].ptr,
+                                                bias_ptr,
+                                                in_dim,
+                                                out_dim,
+                                                batch_size);
+        break;
+      }
+      case OP_BATCHMATMUL: {
+        assert(fused->op_num_inputs[op] == 2);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        Domain out_domain = my_output_accessor[0].domain;
+        Domain a_domain = my_input_accessor[0].domain;
+        Domain b_domain = my_input_accessor[1].domain;
+        int m = b_domain.hi()[0] - b_domain.lo()[0] + 1;
+        assert(m == out_domain.hi()[0] - out_domain.lo()[0] + 1);
+        int n = a_domain.hi()[1] - a_domain.lo()[1] + 1;
+        assert(n == out_domain.hi()[1] - out_domain.lo()[1] + 1);
+        int k = a_domain.hi()[0] - a_domain.lo()[0] + 1;
+        assert(k == b_domain.hi()[1] - b_domain.lo()[1] + 1);
+        assert(a_domain.get_dim() == b_domain.get_dim());
+        assert(a_domain.get_dim() == out_domain.get_dim());
+        int batch = 1;
+        for (int i = 2; i < a_domain.get_dim(); i++) {
+          int dim_size = a_domain.hi()[i] - a_domain.lo()[i] + 1;
+          assert(dim_size == b_domain.hi()[i] - b_domain.lo()[i] + 1);
+          assert(dim_size == out_domain.hi()[i] - out_domain.lo()[i] + 1);
+          batch *= dim_size;
+        }
+        BatchMatmulMeta *meta = (BatchMatmulMeta *)metas->meta[op];
+        Kernels::BatchMatmul::forward_kernel_wrapper(
+            meta,
+            my_output_accessor[0].get_float_ptr(),
+            my_input_accessor[0].get_float_ptr(),
+            my_input_accessor[1].get_float_ptr(),
+            (float const *)nullptr,
+            m,
+            n,
+            k,
+            batch,
+            meta->a_seq_length_dim,
+            meta->b_seq_length_dim,
+            fused->iter_config.seq_length);
+        break;
+      }
+      case OP_EW_ADD:
+      case OP_EW_SUB:
+      case OP_EW_MUL:
+      case OP_EW_DIV:
+      case OP_EW_MAX:
+      case OP_EW_MIN: {
+        assert(fused->op_num_inputs[op] == 2);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain == my_input_accessor[1].domain);
+        assert(my_input_accessor[0].domain == my_output_accessor[0].domain);
+        ElementBinaryMeta *m = (ElementBinaryMeta *)metas->meta[op];
+        Kernels::ElementBinary::forward_kernel_wrapper(m,
+                                                       my_input_accessor[0],
+                                                       my_input_accessor[1],
+                                                       my_output_accessor[0]);
+        break;
+      }
+      case OP_EMBEDDING: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        EmbeddingMeta *m = (EmbeddingMeta *)metas->meta[op];
+        if (m->aggr == AGGR_MODE_NONE) {
+          // assert(kernel_domain.get_dim() == 2);
+          assert(my_input_accessor[0].domain.get_dim() + 1 ==
+                 my_output_accessor[0].domain.get_dim());
+          for (size_t i = 0; i < my_input_accessor[0].domain.get_dim(); i++) {
+            assert(my_input_accessor[0].domain.hi()[i] ==
+                   my_output_accessor[0].domain.hi()[i + 1]);
+            assert(my_input_accessor[0].domain.lo()[i] ==
+                   my_output_accessor[0].domain.lo()[i + 1]);
+          }
+          assert(my_weight_accessor[0].domain.hi()[0] -
+                     my_weight_accessor[0].domain.lo()[0] ==
+                 my_output_accessor[0].domain.hi()[0] -
+                     my_output_accessor[0].domain.lo()[0]);
+        } else {
+          assert(my_input_accessor[0].domain.get_dim() ==
+                 my_output_accessor[0].domain.get_dim());
+          for (size_t i = 1; i < my_input_accessor[0].domain.get_dim(); i++) {
+            assert(my_input_accessor[0].domain.hi()[i] ==
+                   my_output_accessor[0].domain.hi()[i]);
+            assert(my_input_accessor[0].domain.lo()[i] ==
+                   my_output_accessor[0].domain.lo()[i]);
+          }
+          assert(my_weight_accessor[0].domain.hi()[0] -
+                     my_weight_accessor[0].domain.lo()[0] ==
+                 my_output_accessor[0].domain.hi()[0] -
+                     my_output_accessor[0].domain.lo()[0]);
+        }
+        int in_dim, out_dim, effective_batch_size;
+        if (m->aggr == AGGR_MODE_NONE) {
+          in_dim = 1;
+          out_dim = my_output_accessor[0].domain.hi()[0] -
+                    my_output_accessor[0].domain.lo()[0] + 1;
+          effective_batch_size =
+              my_output_accessor[0].domain.get_volume() / out_dim;
+          assert(effective_batch_size * in_dim ==
+                 my_input_accessor[0].domain.get_volume());
+        } else {
+          assert(m->aggr == AGGR_MODE_AVG || m->aggr == AGGR_MODE_SUM);
+          in_dim = my_input_accessor[0].domain.hi()[0] -
+                   my_input_accessor[0].domain.lo()[0] + 1;
+          out_dim = my_output_accessor[0].domain.hi()[0] -
+                    my_output_accessor[0].domain.lo()[0] + 1;
+          effective_batch_size =
+              my_output_accessor[0].domain.get_volume() / out_dim;
+          assert(effective_batch_size * in_dim ==
+                 my_input_accessor[0].domain.get_volume());
+        }
+
+        assert(my_input_accessor[0].data_type == DT_INT32 ||
+               my_input_accessor[0].data_type == DT_INT64);
+        Kernels::Embedding::forward_kernel_wrapper(m,
+                                                   my_input_accessor[0],
+                                                   my_output_accessor[0],
+                                                   my_weight_accessor[0],
+                                                   in_dim,
+                                                   out_dim,
+                                                   effective_batch_size);
+        break;
+      }
+      case OP_GELU:
+      case OP_RELU:
+      case OP_SIGMOID:
+      case OP_TANH:
+      case OP_ELU:
+      case OP_SCALAR_TRUE_DIV: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain == my_output_accessor[0].domain);
+        ElementUnaryMeta *m = (ElementUnaryMeta *)metas->meta[op];
+        if (m->data_type == DT_HALF) {
+          ElementUnary::forward_kernel_wrapper(
+              m,
+              my_input_accessor[0].get_half_ptr(),
+              my_output_accessor[0].get_half_ptr(),
+              my_input_accessor[0].domain.get_volume());
+        } else if (m->data_type == DT_FLOAT) {
+          ElementUnary::forward_kernel_wrapper(
+              m,
+              my_input_accessor[0].get_float_ptr(),
+              my_output_accessor[0].get_float_ptr(),
+              my_input_accessor[0].domain.get_volume());
+        } else {
+          assert(false && "Unsupported data type in ElementUnary forward");
+        }
+        break;
+      }
+      case OP_RMS_NORM: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        RMSNormMeta const *m = (RMSNormMeta *)metas->meta[op];
+        Kernels::RMSNorm::forward_kernel_wrapper(m,
+                                                 my_input_accessor[0],
+                                                 my_weight_accessor[0],
+                                                 my_output_accessor[0]);
+        break;
+      }
+      case OP_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        IncMultiHeadSelfAttentionMeta const *m =
+            (IncMultiHeadSelfAttentionMeta *)metas->meta[op];
+        assert(fused->op_num_weights[op] == (1 + (int)(*m->bias)));
+        GenericTensorAccessorR biases;
+        if (*m->bias) {
+          assert(fused->op_num_weights[op] == 2);
+          biases = my_weight_accessor[1];
+        }
+        IncMultiHeadSelfAttention::inference_kernel_wrapper(
+            m,
+            bc,
+            task->index_point.point_data[0],
+            my_input_accessor[0],
+            my_weight_accessor[0],
+            my_output_accessor[0],
+            biases);
+        break;
+      }
+      case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        TreeIncMultiHeadSelfAttentionMeta *m =
+            (TreeIncMultiHeadSelfAttentionMeta *)metas->meta[op];
+        // TreeVerifyBatchConfig const *tree_bc =
+        //     (TreeVerifyBatchConfig *)task->args;
+        TreeVerifyBatchConfig const &tree_bc =
+            Future(task->futures[0]).get_result<TreeVerifyBatchConfig>();
+        assert(fused->op_num_weights[op] == (1 + (int)(*m->bias)));
+        GenericTensorAccessorR biases;
+        if (*m->bias) {
+          assert(fused->op_num_weights[op] == 2);
+          biases = my_weight_accessor[1];
+        }
+        TreeIncMultiHeadSelfAttention::inference_kernel_wrapper(
+            m,
+            &tree_bc,
+            task->index_point.point_data[0],
+            my_input_accessor[0],
+            my_weight_accessor[0],
+            my_output_accessor[0],
+            biases);
+        break;
+      }
+      case OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        SpecIncMultiHeadSelfAttentionMeta const *m =
+            (SpecIncMultiHeadSelfAttentionMeta *)metas->meta[op];
+        // BeamSearchBatchConfig const *beam_bc =
+        //     (BeamSearchBatchConfig *)task->args;
+        BeamSearchBatchConfig const &beam_bc =
+            Future(task->futures[0]).get_result<BeamSearchBatchConfig>();
+        assert(fused->op_num_weights[op] == (1 + (int)(*m->bias)));
+        GenericTensorAccessorR biases;
+        if (*m->bias) {
+          assert(fused->op_num_weights[op] == 2);
+          biases = my_weight_accessor[1];
+        }
+        SpecIncMultiHeadSelfAttention::inference_kernel_wrapper(
+            m,
+            &beam_bc,
+            task->index_point.point_data[0],
+            my_input_accessor[0],
+            my_weight_accessor[0],
+            my_output_accessor[0],
+            biases);
+        break;
+      }
+      case OP_LAYERNORM: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        LayerNormMeta const *m = (LayerNormMeta *)metas->meta[op];
+        assert(fused->op_num_weights[op] == 2 * (int)(m->elementwise_affine));
+        GenericTensorAccessorR gamma, beta;
+        if (m->elementwise_affine) {
+          gamma = my_weight_accessor[0];
+          beta = my_weight_accessor[1];
+        }
+        LayerNorm::forward_kernel_wrapper(
+            m, my_input_accessor[0], my_output_accessor[0], gamma, beta);
+        break;
+      }
+      case OP_SOFTMAX: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_weights[op] == 0);
+        assert(fused->op_num_outputs[op] == 1);
+        assert(my_input_accessor[0].domain.get_volume() ==
+               my_output_accessor[0].domain.get_volume());
+        SoftmaxMeta *m = (SoftmaxMeta *)metas->meta[op];
+        if (m->input_type == DT_HALF) {
+          Kernels::Softmax::forward_kernel_wrapper(
+              m,
+              my_input_accessor[0].get_half_ptr(),
+              my_output_accessor[0].get_half_ptr());
+        } else if (m->input_type == DT_FLOAT) {
+          Kernels::Softmax::forward_kernel_wrapper(
+              m,
+              my_input_accessor[0].get_float_ptr(),
+              my_output_accessor[0].get_float_ptr());
+        }
+        break;
+      }
+      case OP_ALLREDUCE: {
+        assert(fused->op_num_inputs[op] == 1);
+        assert(fused->op_num_outputs[op] == 1);
+        AllReduceMeta const *m = (AllReduceMeta *)metas->meta[op];
+        Kernels::AllReduce::inference_kernel_wrapper(
+            m, bc, my_input_accessor[0], my_output_accessor[0]);
+        break;
+      }
+      default: {
+        fprintf(stderr,
+                "Fusion currently does not support type = %d\n",
+                fused->op_op_type[op]);
+        assert(false && "Fusion currently does not support type");
+      }
+    }
+    ioff += fused->op_num_inputs[op];
+    woff += fused->op_num_weights[op];
+    ooff += fused->op_num_outputs[op];
+  }
+  // for (int i = 0; i < fused->numOutputs; i++)
+  //   print_tensor<float>(output_ptr[i], output_domain[i].get_volume(),
+  //   "[Fused:forward:output]");
+}
+
 /*
   regions[...](I): input
   regions[...](I): weight
@@ -459,7 +955,6 @@ __host__ void FusedOp::forward_task(Task const *task,
   regions[...](I/O): weight_grad
   regions[...](I/O): output_grad
 */
-
 __host__ void FusedOp::backward_task(Task const *task,
                                      std::vector<PhysicalRegion> const &regions,
                                      Context ctx,
@@ -830,6 +1325,7 @@ __host__ void FusedOp::backward_task(Task const *task,
             batch_size);
         break;
       }
+      case OP_GELU:
       case OP_RELU:
       case OP_SIGMOID:
       case OP_TANH:
diff --git a/src/ops/gather.cc b/src/ops/gather.cc
index f094fe38b0..635c741d8b 100644
--- a/src/ops/gather.cc
+++ b/src/ops/gather.cc
@@ -166,6 +166,7 @@ void Gather::serialize(Legion::Serializer &sez) const {
   GatherParams params = get_params();
   sez.serialize(params.legion_dim);
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
 }
 
 using PCG::Node;
@@ -177,9 +178,10 @@ Node Gather::deserialize(FFModel &ff,
   assert(num_inputs == 2);
   int legion_dim;
   dez.deserialize(legion_dim);
-  size_t id;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
 
   GatherParams params;
   params.legion_dim = legion_dim;
diff --git a/src/ops/group_by.cc b/src/ops/group_by.cc
index 850a5c4587..f2f94234c3 100644
--- a/src/ops/group_by.cc
+++ b/src/ops/group_by.cc
@@ -164,6 +164,56 @@ Group_by::Group_by(FFModel &model,
     : Group_by(
           model, inputs.first, inputs.second, params.n, params.alpha, name) {}
 
+void Group_by::init_inference(FFModel const &ff,
+                              std::vector<ParallelTensor> const &batch_inputs,
+                              std::vector<ParallelTensor> const &batch_outputs,
+                              MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(GROUP_BY_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Group_by)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // data
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  // assign
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+
+  // output
+  for (int i = 0; i < n; i++) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[i]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[i]->region));
+    launcher.add_field(i + 2, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void Group_by::init(FFModel const &ff) {
   assert(check_output_input_weight_same_parallel_is());
   parallel_is = outputs[0]->parallel_is;
@@ -214,7 +264,7 @@ OpMeta *Group_by::init_task(Task const *task,
                             Runtime *runtime) {
   Group_by *gb = (Group_by *)task->args;
   FFHandler handle = *((FFHandler *)task->local_args);
-  GroupByMeta *m = new GroupByMeta(handle, gb->n);
+  GroupByMeta *m = new GroupByMeta(handle, gb->n, gb->alpha);
   m->profiling = gb->profiling;
   return m;
 }
@@ -226,7 +276,7 @@ void Group_by::forward(FFModel const &ff) {
   set_argumentmap_for_forward(ff, argmap);
   IndexLauncher launcher(GROUP_BY_FWD_TASK_ID,
                          parallel_is,
-                         TaskArgument(this, sizeof(Group_by)),
+                         TaskArgument(NULL, 0),
                          argmap,
                          Predicate::TRUE_PRED,
                          false /*must*/,
@@ -261,16 +311,62 @@ void Group_by::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap Group_by::inference(FFModel const &ff,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &batch_inputs,
+                              std::vector<ParallelTensor> const &batch_outputs,
+                              MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  /* std::cout << "GroupBy op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(GROUP_BY_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // data
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+
+  // assign
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[1]->region));
+  launcher.add_field(1, FID_DATA);
+
+  // output
+  for (int i = 0; i < n; i++) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[i]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[i]->region));
+    launcher.add_field(i + 2, FID_DATA);
+  }
+
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void Group_by::forward_task(Task const *task,
                             std::vector<PhysicalRegion> const &regions,
                             Context ctx,
                             Runtime *runtime) {
-  // Get n, alpha
-  Group_by const *gb = (Group_by *)task->args;
-  int n = gb->n;
-  float alpha = gb->alpha;
-
-  assert((int)regions.size() == n + 2);
+  int n = (int)regions.size() - 2;
   assert((int)task->regions.size() == n + 2);
 
   GroupByMeta const *m = *((GroupByMeta **)task->local_args);
@@ -297,7 +393,6 @@ void Group_by::forward_task(Task const *task,
   // Each entry in the "outputs" vector points to the Legion tensor that will
   // contain the tockens dispatched to the corresponding expert
   float *outputs[n];
-  int exp_output_rows = (int)ceil(alpha * k / n * batch_size);
   for (int i = 0; i < n; i++) {
     Domain out_domain = runtime->get_index_space_domain(
         ctx, task->regions[i + 2].region.get_index_space());
@@ -306,7 +401,6 @@ void Group_by::forward_task(Task const *task,
 
     coord_t output_rows = out_domain.hi()[1] - out_domain.lo()[1] + 1;
     coord_t output_cols = out_domain.hi()[0] - out_domain.lo()[0] + 1;
-    assert((int)output_rows == exp_output_rows);
     assert(output_cols == input_cols);
   }
 
@@ -316,7 +410,6 @@ void Group_by::forward_task(Task const *task,
                                    outputs,
                                    n,
                                    k,
-                                   alpha,
                                    batch_size,
                                    data_dim);
 }
@@ -328,7 +421,7 @@ void Group_by::backward(FFModel const &ff) {
   set_argumentmap_for_backward(ff, argmap);
   IndexLauncher launcher(GROUP_BY_BWD_TASK_ID,
                          parallel_is,
-                         TaskArgument(this, sizeof(Group_by)),
+                         TaskArgument(NULL, 0),
                          argmap,
                          Predicate::TRUE_PRED,
                          false /*must*/,
@@ -368,13 +461,9 @@ void Group_by::backward_task(Task const *task,
                              std::vector<PhysicalRegion> const &regions,
                              Context ctx,
                              Runtime *runtime) {
-  // Get n, alpha
   GroupByMeta const *m = *((GroupByMeta **)task->local_args);
-  Group_by const *gb = (Group_by *)task->args;
-  int n = gb->n;
-  float alpha = gb->alpha;
 
-  assert((int)regions.size() == n + 2);
+  int n = (int)regions.size() - 2;
   assert((int)task->regions.size() == n + 2);
 
   // get input and assign regions
@@ -396,7 +485,6 @@ void Group_by::backward_task(Task const *task,
 
   // get output
   float *output_grads[n];
-  int exp_output_rows = (int)ceil(alpha * k / n * batch_size);
   for (int i = 0; i < n; i++) {
     Domain out_domain = runtime->get_index_space_domain(
         ctx, task->regions[i + 2].region.get_index_space());
@@ -405,7 +493,6 @@ void Group_by::backward_task(Task const *task,
 
     coord_t output_rows = out_domain.hi()[1] - out_domain.lo()[1] + 1;
     coord_t output_cols = out_domain.hi()[0] - out_domain.lo()[0] + 1;
-    assert((int)output_rows == exp_output_rows);
     assert(output_cols == input_cols);
   }
 
@@ -415,7 +502,6 @@ void Group_by::backward_task(Task const *task,
                                     output_grads,
                                     n,
                                     k,
-                                    alpha,
                                     batch_size,
                                     data_dim);
 }
@@ -466,7 +552,7 @@ bool Group_by::measure_operator_cost(Simulator *sim,
     }
   }
 
-  GroupByMeta *m = new GroupByMeta(sim->handler, n);
+  GroupByMeta *m = new GroupByMeta(sim->handler, n, alpha);
 
   // allocate
   sim->free_all();
@@ -500,15 +586,8 @@ bool Group_by::measure_operator_cost(Simulator *sim,
   int data_dim = in_domain.hi()[0] - in_domain.lo()[0] + 1;
 
   forward = [&] {
-    forward_kernel_wrapper(m,
-                           input_ptr,
-                           assign_ptr,
-                           output_ptrs,
-                           n,
-                           k,
-                           alpha,
-                           batch_size,
-                           data_dim);
+    forward_kernel_wrapper(
+        m, input_ptr, assign_ptr, output_ptrs, n, k, batch_size, data_dim);
   };
 
   inner_measure_operator_cost(sim, forward, backward, cost_metrics);
diff --git a/src/ops/group_by.cpp b/src/ops/group_by.cpp
index f45e9092a5..51bcd7d7b4 100644
--- a/src/ops/group_by.cpp
+++ b/src/ops/group_by.cpp
@@ -118,16 +118,17 @@ __global__ void
 }
 
 /*static*/
-void Group_by::forward_kernel_wrapper(
-    GroupByMeta const *m,
-    float const *input,
-    int const *exp_assign,
-    float **outputs,
-    int n,       // num experts
-    int k,       // chosen experts
-    float alpha, // factor additional memory assigned
-    int batch_size,
-    int data_dim) {
+void Group_by::forward_kernel_wrapper(GroupByMeta const *m,
+                                      float const *input,
+                                      int const *exp_assign,
+                                      float **outputs,
+                                      int n, // num experts
+                                      int k, // chosen experts
+                                      int batch_size,
+                                      int data_dim) {
+
+  float alpha = m->alpha;
+
   // TODO: why cublas/cudnn stream is needed here?
   hipStream_t stream;
   checkCUDA(get_legion_stream(&stream));
@@ -151,16 +152,17 @@ void Group_by::forward_kernel_wrapper(
                      data_dim);
 }
 
-void Group_by::backward_kernel_wrapper(
-    GroupByMeta const *m,
-    float *input_grad,
-    int const *exp_assign,
-    float **output_grads,
-    int n,       // num experts
-    int k,       // chosen experts
-    float alpha, // factor additional memory assigned
-    int batch_size,
-    int data_dim) {
+void Group_by::backward_kernel_wrapper(GroupByMeta const *m,
+                                       float *input_grad,
+                                       int const *exp_assign,
+                                       float **output_grads,
+                                       int n, // num experts
+                                       int k, // chosen experts
+                                       int batch_size,
+                                       int data_dim) {
+
+  float alpha = m->alpha;
+
   // TODO: why cublas/cudnn stream is needed here
   hipStream_t stream;
   checkCUDA(get_legion_stream(&stream));
@@ -186,7 +188,8 @@ void Group_by::backward_kernel_wrapper(
                      data_dim);
 }
 
-GroupByMeta::GroupByMeta(FFHandler handler, int n) : OpMeta(handler) {
+GroupByMeta::GroupByMeta(FFHandler handler, int n, float _alpha)
+    : OpMeta(handler), alpha(_alpha) {
   checkCUDA(hipMalloc(&dev_region_ptrs, n * sizeof(float *)));
 }
 GroupByMeta::~GroupByMeta(void) {
diff --git a/src/ops/group_by.cu b/src/ops/group_by.cu
index ee0b18337c..0ed09e20b3 100644
--- a/src/ops/group_by.cu
+++ b/src/ops/group_by.cu
@@ -106,17 +106,18 @@ __global__ void
 }
 
 /*static*/
-void Group_by::forward_kernel_wrapper(
-    GroupByMeta const *m,
-    float const *input,
-    int const *exp_assign,
-    float **outputs,
-    int n,       // num experts
-    int k,       // chosen experts
-    float alpha, // factor additional memory assigned
-    int batch_size,
-    int data_dim) {
+void Group_by::forward_kernel_wrapper(GroupByMeta const *m,
+                                      float const *input,
+                                      int const *exp_assign,
+                                      float **outputs,
+                                      int n, // num experts
+                                      int k, // chosen experts
+                                      int batch_size,
+                                      int data_dim) {
   // TODO: why cublas/cudnn stream is needed here?
+
+  float alpha = m->alpha;
+
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
   cudaEvent_t t_start, t_end;
@@ -148,16 +149,17 @@ void Group_by::forward_kernel_wrapper(
   }
 }
 
-void Group_by::backward_kernel_wrapper(
-    GroupByMeta const *m,
-    float *input_grad,
-    int const *exp_assign,
-    float **output_grads,
-    int n,       // num experts
-    int k,       // chosen experts
-    float alpha, // factor additional memory assigned
-    int batch_size,
-    int data_dim) {
+void Group_by::backward_kernel_wrapper(GroupByMeta const *m,
+                                       float *input_grad,
+                                       int const *exp_assign,
+                                       float **output_grads,
+                                       int n, // num experts
+                                       int k, // chosen experts
+                                       int batch_size,
+                                       int data_dim) {
+
+  float alpha = m->alpha;
+
   // TODO: why cublas/cudnn stream is needed here
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
@@ -196,7 +198,8 @@ void Group_by::backward_kernel_wrapper(
   }
 }
 
-GroupByMeta::GroupByMeta(FFHandler handler, int n) : OpMeta(handler) {
+GroupByMeta::GroupByMeta(FFHandler handler, int n, float _alpha)
+    : OpMeta(handler), alpha(_alpha) {
   checkCUDA(cudaMalloc(&dev_region_ptrs, n * sizeof(float *)));
 }
 GroupByMeta::~GroupByMeta(void) {
diff --git a/src/ops/inc_multihead_self_attention.cc b/src/ops/inc_multihead_self_attention.cc
new file mode 100644
index 0000000000..f4f64aee8a
--- /dev/null
+++ b/src/ops/inc_multihead_self_attention.cc
@@ -0,0 +1,1686 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/inc_multihead_self_attention.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/model.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#ifdef INFERENCE_TESTS
+#include <torch/torch.h>
+using namespace at::indexing;
+#endif
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+LegionRuntime::Logger::Category log_inc_mha("IncrementalMHA");
+
+bool IncMultiHeadSelfAttentionParams::is_valid(
+    ParallelTensorShape const &input) const {
+  bool is_valid = input.is_valid();
+  return is_valid;
+}
+
+Tensor FFModel::inc_multihead_self_attention(const Tensor input,
+                                             int embed_dim,
+                                             int num_heads,
+                                             int kdim,
+                                             int vdim,
+                                             float dropout,
+                                             bool bias,
+                                             bool add_bias_kv,
+                                             bool add_zero_attn,
+                                             DataType data_type,
+                                             Initializer *kernel_initializer,
+                                             bool apply_rotary_embedding,
+                                             bool scaling_query,
+                                             float scaling_factor,
+                                             bool qk_prod_scaling,
+                                             char const *name) {
+  return inc_multiquery_self_attention(input,
+                                       embed_dim,
+                                       num_heads,
+                                       num_heads,
+                                       kdim,
+                                       vdim,
+                                       dropout,
+                                       bias,
+                                       add_bias_kv,
+                                       add_zero_attn,
+                                       data_type,
+                                       kernel_initializer,
+                                       apply_rotary_embedding,
+                                       scaling_query,
+                                       scaling_factor,
+                                       qk_prod_scaling,
+                                       name);
+}
+
+Tensor FFModel::inc_multiquery_self_attention(const Tensor input,
+                                              int embed_dim,
+                                              int num_q_heads,
+                                              int num_kv_heads,
+                                              int kdim,
+                                              int vdim,
+                                              float dropout,
+                                              bool bias,
+                                              bool add_bias_kv,
+                                              bool add_zero_attn,
+                                              DataType data_type,
+                                              Initializer *kernel_initializer,
+                                              bool apply_rotary_embedding,
+                                              bool scaling_query,
+                                              float scaling_factor,
+                                              bool qk_prod_scaling,
+                                              char const *name) {
+  if (data_type == DT_NONE) {
+    data_type = input->data_type;
+  }
+  DataType quantization_type = cpu_offload ? config.quantization_type : DT_NONE;
+  bool offload = cpu_offload;
+  Layer *li = nullptr;
+  int weight_num = bias ? 2 : 1;
+  if (data_type != input->data_type) {
+    Tensor casted_input = cast(input, data_type, "type cast for IncMHA");
+    li = new Layer(this,
+                   OP_INC_MULTIHEAD_SELF_ATTENTION,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   weight_num /*weights*/,
+                   1 /*outputs*/,
+                   casted_input);
+  } else {
+    li = new Layer(this,
+                   OP_INC_MULTIHEAD_SELF_ATTENTION,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   weight_num /*weights*/,
+                   1 /*outputs*/,
+                   input);
+  }
+  {
+    int numdims = input->num_dims;
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    dims[0] = embed_dim;
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, data_type, li, 0, true /*create_grad*/);
+  }
+  // Compute weight size
+  int qProjSize = kdim, kProjSize = kdim, vProjSize = kdim,
+      oProjSize = embed_dim;
+  int qSize = input->dims[0], kSize = input->dims[0], vSize = input->dims[0];
+  int qParas = qProjSize * qSize;
+  int kParas = kProjSize * kSize;
+  int vParas = vProjSize * vSize;
+  int oParas = oProjSize * (vProjSize > 0 ? vProjSize : vSize);
+  int weight_size = qParas * num_q_heads + kParas * num_kv_heads +
+                    vParas * num_kv_heads + oParas * num_q_heads;
+  int one_head_size = qParas + kParas + vParas + oParas;
+
+  {
+    // compress the weight size if quantization.
+    if (quantization_type != DT_NONE) {
+      one_head_size = get_quantization_to_byte_size(
+          data_type, quantization_type, one_head_size);
+    }
+    int dims[1] = {weight_size};
+    li->weights[0] = create_weight_legion_ordering(
+        1,
+        dims,
+        quantization_type == DT_NONE ? data_type : quantization_type,
+        li,
+        true /*create_grad*/,
+        kernel_initializer,
+        CHOSEN_SYNC_TYPE);
+  }
+  if (bias) {
+    // q, k, v, o
+    int dims[1] = {qProjSize * num_q_heads +
+                   (kProjSize + vProjSize) * num_kv_heads + oProjSize};
+    li->weights[1] = create_weight_legion_ordering(1,
+                                                   dims,
+                                                   data_type,
+                                                   li,
+                                                   true /*create_grad*/,
+                                                   kernel_initializer,
+                                                   CHOSEN_SYNC_TYPE);
+  }
+  li->data_type = data_type;
+  li->add_int_property("embed_dim", embed_dim);
+  li->add_int_property("num_q_heads", num_q_heads);
+  li->add_int_property("num_kv_heads", num_kv_heads);
+  li->add_int_property("kdim", kdim);
+  li->add_int_property("vdim", vdim);
+  li->add_int_property("bias", bias);
+  li->add_int_property("add_bias_kv", add_bias_kv);
+  li->add_int_property("add_zero_attn", add_zero_attn);
+  li->add_float_property("dropout", dropout);
+  li->add_int_property("apply_rotary_embedding", apply_rotary_embedding);
+  li->add_int_property("scaling_query", scaling_query);
+  li->add_float_property("scaling_factor", scaling_factor);
+  li->add_int_property("qk_prod_scaling", qk_prod_scaling);
+  li->add_int_property("quantization_type", quantization_type);
+  li->add_int_property("offload", offload);
+  li->add_int_property("tensor_parallelism_degree",
+                       config.tensor_parallelism_degree);
+  layers.push_back(li);
+
+  return li->outputs[0];
+}
+
+Op *IncMultiHeadSelfAttention::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  long long value;
+  layer->get_int_property("embed_dim", value);
+  int embed_dim = value;
+  layer->get_int_property("num_q_heads", value);
+  int num_q_heads = value;
+  layer->get_int_property("num_kv_heads", value);
+  int num_kv_heads = value;
+  layer->get_int_property("kdim", value);
+  int kdim = value;
+  layer->get_int_property("vdim", value);
+  int vdim = value;
+  float dropout;
+  layer->get_float_property("dropout", dropout);
+  layer->get_int_property("bias", value);
+  bool bias = (bool)value;
+  layer->get_int_property("add_bias_kv", value);
+  bool add_bias_kv = (bool)value;
+  layer->get_int_property("add_zero_attn", value);
+  bool add_zero_attn = (bool)value;
+  layer->get_int_property("apply_rotary_embedding", value);
+  bool apply_rotary_embedding = (bool)value;
+  layer->get_int_property("scaling_query", value);
+  bool scaling_query = (bool)value;
+  float scaling_factor;
+  layer->get_float_property("scaling_factor", scaling_factor);
+  layer->get_int_property("qk_prod_scaling", value);
+  bool qk_prod_scaling = (bool)value;
+  layer->get_int_property("quantization_type", value);
+  DataType quantization_type = (DataType)value;
+  layer->get_int_property("offload", value);
+  bool offload = (bool)value;
+  layer->get_int_property("tensor_parallelism_degree", value);
+  int tensor_parallelism_degree = (int)value;
+
+  return new IncMultiHeadSelfAttention(model,
+                                       layer->layer_guid,
+                                       inputs[0],
+                                       embed_dim,
+                                       num_q_heads,
+                                       num_kv_heads,
+                                       kdim,
+                                       vdim,
+                                       dropout,
+                                       bias,
+                                       add_bias_kv,
+                                       add_zero_attn,
+                                       apply_rotary_embedding,
+                                       scaling_query,
+                                       scaling_factor,
+                                       qk_prod_scaling,
+                                       false /*allocate_weights*/,
+                                       quantization_type,
+                                       offload,
+                                       tensor_parallelism_degree,
+                                       layer->name);
+}
+
+IncMultiHeadSelfAttention::IncMultiHeadSelfAttention(
+    FFModel &model,
+    LayerID const &_layer_guid,
+    const ParallelTensor _input,
+    int _embed_dim,
+    int _num_q_heads,
+    int _num_kv_heads,
+    int _kdim,
+    int _vdim,
+    float _dropout,
+    bool _bias,
+    bool _add_bias_kv,
+    bool _add_zero_attn,
+    bool _apply_rotary_embedding,
+    bool _scaling_query,
+    float _scaling_factor,
+    bool _qk_prod_scaling,
+    bool allocate_weights,
+    DataType _quantization_type,
+    bool _offload,
+    int _tensor_parallelism_degree,
+    char const *name)
+    // Initializer* _bias_initializer)
+    : Op(model,
+         OP_INC_MULTIHEAD_SELF_ATTENTION,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         (_bias ? 2 : 1), /*weights*/
+         1 /*outputs*/,
+         _input),
+      num_q_heads(_num_q_heads), num_kv_heads(_num_kv_heads), dropout(_dropout),
+      bias(_bias), add_bias_kv(_add_bias_kv), add_zero_attn(_add_zero_attn),
+      apply_rotary_embedding(_apply_rotary_embedding),
+      qSize(_input->dims[0].size), kSize(_input->dims[0].size),
+      vSize(_input->dims[0].size), qProjSize(_kdim), kProjSize(_kdim),
+      vProjSize(_vdim), oProjSize(_embed_dim),
+      qoSeqLength(_input->dims[1].size), kvSeqLength(_input->dims[1].size),
+      scaling_query(_scaling_query), scaling_factor(_scaling_factor),
+      qk_prod_scaling(_qk_prod_scaling), quantization_type(_quantization_type),
+      offload(_offload), tensor_parallelism_degree(_tensor_parallelism_degree) {
+  // overwrite layer_guid
+  layer_guid = _layer_guid;
+  numOutputs = 1;
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  size_t x = 1;
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+    x *= _input->dims[i].size;
+  }
+  dims[0].size = _embed_dim;
+  // Currently require no parallelism along this dim
+  assert(dims[0].degree == 1);
+  if (allocate_weights) {
+    // Create weight tensor
+    int num_dims = inputs[0]->num_dims;
+    // Compute weight size
+    int qParas = this->qProjSize * this->qSize;
+    int kParas = this->kProjSize * this->kSize;
+    int vParas = this->vProjSize * this->vSize;
+    int oParas =
+        this->oProjSize * (this->vProjSize > 0 ? this->vProjSize : this->vSize);
+    ParallelDim dims[2];
+    dims[0] = inputs[0]->dims[num_dims - 2];
+    dims[0].size = dims[0].degree;
+    dims[1] = inputs[0]->dims[num_dims - 1];
+    dims[1].size = this->num_q_heads * (qParas + oParas) +
+                   this->num_kv_heads * (kParas + vParas);
+    dims[1].is_replica_dim = false;
+
+    if (quantization_type != DT_NONE) {
+      dims[1].size = get_quantization_to_byte_size(
+          data_type, quantization_type, (qParas + kParas + vParas + oParas));
+    }
+    int seed = std::rand();
+    Initializer *initializer = new GlorotUniform(seed);
+    weights[0] = model.create_parallel_weight<2>(
+        dims,
+        quantization_type == DT_NONE ? this->data_type : quantization_type,
+        nullptr /*owner_op*/,
+        true /*create_grad*/,
+        initializer,
+        CHOSEN_SYNC_TYPE);
+    if (bias) {
+      ParallelTensorShape bias_shape = _input->get_shape();
+      bias_shape.dims[0].size = qProjSize * num_q_heads +
+                                (kProjSize + vProjSize) * num_kv_heads +
+                                oProjSize;
+      bias_shape.dims[1].size = bias_shape.dims[2].size = 1;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(bias_shape.num_dims,
+                                                       bias_shape.dims,
+                                                       this->data_type,
+                                                       nullptr /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       initializer,
+                                                       CHOSEN_SYNC_TYPE);
+    }
+  }
+
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, dims, this->data_type, this);
+  /* for (int i = 0; i < numdim; i++) { */
+  /*   register_output_input_parallel_dims(outputs[0], i, inputs[0], i); */
+  /* } */
+  /* // Check correctness */
+  /* assert(check_output_input_weight_parallel_dims()); */
+}
+
+IncMultiHeadSelfAttention::IncMultiHeadSelfAttention(
+    FFModel &model,
+    const ParallelTensor _input,
+    const ParallelTensor _weight,
+    int _embed_dim,
+    int _num_q_heads,
+    int _num_kv_heads,
+    int _kdim,
+    int _vdim,
+    float _dropout,
+    bool _bias,
+    bool _add_bias_kv,
+    bool _add_zero_attn,
+    bool _apply_rotary_embedding,
+    bool _scaling_query,
+    float _scaling_factor,
+    bool _qk_prod_scaling,
+    bool allocate_weights,
+    DataType _quantization_type,
+    bool _offload,
+    int _tensor_parallelism_degree,
+    char const *name)
+    // Initializer* _bias_initializer)
+    : Op(model,
+         OP_INC_MULTIHEAD_SELF_ATTENTION,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         (_bias ? 2 : 1), /*weights*/
+         1 /*outputs*/,
+         _input,
+         _weight),
+      num_q_heads(_num_q_heads), num_kv_heads(_num_kv_heads), dropout(_dropout),
+      bias(_bias), add_bias_kv(_add_bias_kv), add_zero_attn(_add_zero_attn),
+      apply_rotary_embedding(_apply_rotary_embedding),
+      qSize(_input->dims[0].size), kSize(_input->dims[0].size),
+      vSize(_input->dims[0].size), qProjSize(_kdim), kProjSize(_kdim),
+      vProjSize(_vdim), oProjSize(_embed_dim),
+      qoSeqLength(_input->dims[1].size), kvSeqLength(_input->dims[1].size),
+      scaling_query(_scaling_query), scaling_factor(_scaling_factor),
+      qk_prod_scaling(_qk_prod_scaling), quantization_type(_quantization_type),
+      offload(_offload), tensor_parallelism_degree(_tensor_parallelism_degree)
+// bias_initializer(_bias_initializer)
+{
+  numOutputs = 1;
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+  }
+  dims[0].size = _embed_dim;
+  // Currently require no parallelism along this dim
+  assert(dims[0].degree == 1);
+  if (allocate_weights) {
+    // Create weight tensor
+    int num_dims = inputs[0]->num_dims;
+    // Compute weight size
+    int qParas = this->qProjSize * this->qSize;
+    int kParas = this->kProjSize * this->kSize;
+    int vParas = this->vProjSize * this->vSize;
+    int oParas =
+        this->oProjSize * (this->vProjSize > 0 ? this->vProjSize : this->vSize);
+    ParallelDim dims[2];
+    dims[0] = inputs[0]->dims[num_dims - 2];
+    dims[0].size = dims[0].degree;
+    dims[1] = inputs[0]->dims[num_dims - 1];
+    dims[1].size = this->num_q_heads * (qParas + oParas) +
+                   this->num_kv_heads * (kParas + vParas);
+    dims[1].is_replica_dim = false;
+    // dims[2].size = this->num_q_heads * (qParas + oParas) + this->num_kv_heads
+    // * (kParas + vParas);
+    if (quantization_type != DT_NONE) {
+      dims[1].size = get_quantization_to_byte_size(
+          data_type, quantization_type, (qParas + kParas + vParas + oParas));
+    }
+    int seed = std::rand();
+    Initializer *initializer = new GlorotUniform(seed);
+    weights[0] = model.create_parallel_weight<2>(
+        dims,
+        quantization_type == DT_NONE ? this->data_type : quantization_type,
+        NULL /*owner_op*/,
+        true /*create_grad*/,
+        initializer,
+        CHOSEN_SYNC_TYPE);
+    if (bias) {
+      ParallelTensorShape bias_shape = _input->get_shape();
+      bias_shape.dims[0].size = qProjSize * num_q_heads +
+                                (kProjSize + vProjSize) * num_kv_heads +
+                                oProjSize;
+      bias_shape.dims[1].size = bias_shape.dims[2].size = 1;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(bias_shape.num_dims,
+                                                       bias_shape.dims,
+                                                       this->data_type,
+                                                       nullptr /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       initializer,
+                                                       CHOSEN_SYNC_TYPE);
+    }
+  }
+
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, dims, this->data_type, this);
+
+  /* for (int i = 0; i < numdim; i++) { */
+  /*   register_output_input_parallel_dims(outputs[0], i, inputs[0], i); */
+  /* } */
+  /* register_output_weight_parallel_dims(outputs[0], numdim-1, _weight, 1); */
+  /* register_output_weight_parallel_dims(outputs[0], numdim-2, _weight, 2); */
+  // Check correctness
+  /* assert(check_output_input_weight_parallel_dims()); */
+}
+
+IncMultiHeadSelfAttention::IncMultiHeadSelfAttention(
+    FFModel &model,
+    IncMultiHeadSelfAttention const &other,
+    const ParallelTensor input,
+    bool allocate_weights)
+    : IncMultiHeadSelfAttention(model,
+                                other.layer_guid,
+                                input,
+                                other.oProjSize,
+                                other.num_q_heads,
+                                other.num_kv_heads,
+                                other.qProjSize,
+                                other.vProjSize,
+                                other.dropout,
+                                other.bias,
+                                other.add_bias_kv,
+                                other.add_zero_attn,
+                                other.apply_rotary_embedding,
+                                other.scaling_query,
+                                other.scaling_factor,
+                                other.qk_prod_scaling,
+                                allocate_weights,
+                                other.quantization_type,
+                                other.offload,
+                                other.tensor_parallelism_degree,
+                                other.name) {}
+
+IncMultiHeadSelfAttention::IncMultiHeadSelfAttention(
+    FFModel &model,
+    IncMultiHeadSelfAttentionParams const &params,
+    ParallelTensor const &input,
+    bool allocate_weights,
+    char const *name)
+    : IncMultiHeadSelfAttention(model,
+                                params.layer_guid,
+                                input,
+                                params.embed_dim,
+                                params.num_q_heads,
+                                params.num_kv_heads,
+                                params.kdim,
+                                params.vdim,
+                                params.dropout,
+                                params.bias,
+                                params.add_bias_kv,
+                                params.add_zero_attn,
+                                params.apply_rotary_embedding,
+                                params.scaling_query,
+                                params.scaling_factor,
+                                params.qk_prod_scaling,
+                                allocate_weights,
+                                params.quantization_type,
+                                params.offload,
+                                params.tensor_parallelism_degree,
+                                name) {}
+
+void IncMultiHeadSelfAttention::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(IncMultiHeadSelfAttention)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(weights[0]->part,
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        weights[0]->region,
+                        ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void IncMultiHeadSelfAttention::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(IncMultiHeadSelfAttention)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+/*
+  regions[0](I): input
+  regions[1](I): weight
+  regions[2](O): output
+*/
+OpMeta *IncMultiHeadSelfAttention::init_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+
+  IncMultiHeadSelfAttention const *attn =
+      (IncMultiHeadSelfAttention *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+
+  GenericTensorAccessorR input =
+      helperGetGenericTensorAccessorRO(attn->inputs[0]->data_type,
+                                       regions[0],
+                                       task->regions[0],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  GenericTensorAccessorR weight =
+      helperGetGenericTensorAccessorRO(attn->weights[0]->data_type,
+                                       regions[1],
+                                       task->regions[1],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  GenericTensorAccessorW output =
+      helperGetGenericTensorAccessorWO(attn->outputs[0]->data_type,
+                                       regions[2],
+                                       task->regions[2],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+
+  int num_samples = input.domain.hi()[2] - input.domain.lo()[2] + 1;
+  assert(attn->qoSeqLength == input.domain.hi()[1] - input.domain.lo()[1] + 1);
+  assert(attn->kvSeqLength == input.domain.hi()[1] - input.domain.lo()[1] + 1);
+  int num_q_heads = attn->num_q_heads / attn->tensor_parallelism_degree;
+  int num_kv_heads = attn->num_kv_heads / attn->tensor_parallelism_degree;
+
+  assert(attn->oProjSize == output.domain.hi()[0] - output.domain.lo()[0] + 1);
+
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  if (attn->offload) {
+    // cpu-offload enabled
+    // use offload_reserved_space
+    gpu_mem_allocator.register_reserved_work_space(
+        handle.offload_reserve_space, handle.offload_reserve_space_size);
+  }
+  IncMultiHeadSelfAttentionMeta *m =
+      new IncMultiHeadSelfAttentionMeta(handle,
+                                        attn,
+                                        weight,
+                                        gpu_mem_allocator,
+                                        num_samples,
+                                        num_q_heads,
+                                        num_kv_heads);
+  if (handle.offload_reserve_space == nullptr) {
+    // assert that we didn't over allocate memory
+    assert(gpu_mem_allocator.reserved_allocated_size ==
+           gpu_mem_allocator.reserved_total_size);
+  }
+  m->profiling = attn->profiling;
+  if (attn->quantization_type == DT_NONE) {
+    assert(weight.domain.get_volume() * data_type_size(weight.data_type) ==
+           m->weightSize);
+  }
+
+  return m;
+}
+
+void IncMultiHeadSelfAttention::forward(FFModel const &ff) {
+  // IncMultiHeadSelfAttention doesn't support forward
+  assert(false);
+}
+
+FutureMap IncMultiHeadSelfAttention::inference(
+    FFModel const &ff,
+    BatchConfigFuture const &bc,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  int idx = 0;
+  // log_inc_mha.debug("BatchConfig, num_tokens: %d, num_requests: %d",
+  //                   bc->num_tokens,
+  //                   bc->num_active_requests());
+  IndexLauncher launcher(INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(weights[0]->part,
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        weights[0]->region,
+                        ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+
+  if (bias) {
+    launcher.add_region_requirement(
+        RegionRequirement(weights[1]->part,
+                          0 /*projection id*/,
+                          READ_ONLY,
+                          EXCLUSIVE,
+                          weights[1]->region,
+                          ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+    launcher.add_field(idx++, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+/*
+  regions[0](I): input
+  regions[3](I): weight
+  regions[4](O): output
+*/
+void IncMultiHeadSelfAttention::inference_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+
+  assert(task->regions.size() == regions.size());
+
+  // BatchConfig const *bc = (BatchConfig *)task->args;
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  log_inc_mha.debug("BatchConfig, num_tokens: %d, num_requests: %d",
+                    bc->num_tokens,
+                    bc->num_active_requests());
+  if (bc->num_tokens == 0) {
+    return;
+  }
+
+  IncMultiHeadSelfAttentionMeta const *m =
+      *((IncMultiHeadSelfAttentionMeta **)task->local_args);
+
+  assert((*m->bias ? regions.size() == 4 : regions.size() == 3));
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR weight = helperGetGenericTensorAccessorRO(
+      m->weight_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR biases;
+  if (*m->bias) {
+    biases = helperGetGenericTensorAccessorRO(m->weight_type[1],
+                                              regions[3],
+                                              task->regions[3],
+                                              FID_DATA,
+                                              ctx,
+                                              runtime);
+    Domain bias_domain = runtime->get_index_space_domain(
+        ctx, task->regions[3].region.get_index_space());
+    assert(bias_domain.get_dim() == 4);
+  }
+
+  Domain input_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  Domain weight_domain = runtime->get_index_space_domain(
+      ctx, task->regions[1].region.get_index_space());
+  Domain output_domain = runtime->get_index_space_domain(
+      ctx, task->regions[2].region.get_index_space());
+
+  assert(input_domain.get_dim() == 4);
+  assert(weight_domain.get_dim() == 2);
+  assert(output_domain.get_dim() == 4);
+
+  assert(task->index_point.get_dim() == 1);
+
+  IncMultiHeadSelfAttention::inference_kernel_wrapper(
+      m, bc, task->index_point.point_data[0], input, weight, output, biases);
+#ifdef INFERENCE_TESTS
+  printf("Checking IncMultiHeadSelfAttention computations...\n");
+
+  // =============================================================================
+  //  Define helper functions to handle row-major arrays
+  // =============================================================================
+
+  auto set_value_row_major = [](float *arr,
+                                std::vector<int> const &shape,
+                                std::vector<int> const &indices,
+                                float value) -> void {
+    int offset = 0;
+    for (int i = 0; i < shape.size(); i++) {
+      int index = indices[i];
+      int stride = 1;
+      for (int j = i + 1; j < shape.size(); j++) {
+        stride *= shape[j];
+      }
+      offset += index * stride;
+    }
+    *(arr + offset) = value;
+  };
+
+  // =============================================================================
+  //  Load input/output/weights and parse general configs
+  // =============================================================================
+
+  float *input_cpu =
+      download_tensor<float>(input.get_float_ptr(), input_domain.get_volume());
+  assert(input_cpu != nullptr);
+  float *weight_cpu = download_tensor<float>(weight.get_float_ptr(),
+                                             weight_domain.get_volume());
+  assert(weight_cpu != nullptr);
+  float *output_cpu = download_tensor<float>(output.get_float_ptr(),
+                                             output_domain.get_volume());
+  assert(output_cpu != nullptr);
+
+  // Input tensor dimensions
+  coord_t data_dim = input_domain.hi()[0] - input_domain.lo()[0] + 1;
+  coord_t max_sequence_length = input_domain.hi()[1] - input_domain.lo()[1] + 1;
+  coord_t batch_size = input_domain.hi()[2] - input_domain.lo()[2] + 1;
+  coord_t replica_dim = input_domain.hi()[3] - input_domain.lo()[3] + 1;
+  assert(replica_dim == 1);
+
+  size_t effective_batch_size = max_sequence_length * batch_size;
+  float inputs_arr[data_dim][effective_batch_size] = {0};
+  for (size_t i = 0; i < data_dim * bc->num_active_tokens(); i++) {
+    size_t data_index = i % data_dim;
+    size_t token_index = i / data_dim;
+    assert(data_index < data_dim);
+    assert(token_index < effective_batch_size);
+    inputs_arr[data_index][token_index] = input_cpu[i];
+  }
+  torch::Tensor torch_input = torch::from_blob(
+      inputs_arr, {data_dim, (long int)effective_batch_size}, torch::kFloat32);
+
+  // Weight tensor dimensions
+  coord_t all_weight_params = weight_domain.hi()[0] - weight_domain.lo()[0] + 1;
+  coord_t num_q_heads = weight_domain.hi()[1] - weight_domain.lo()[1] + 1;
+  replica_dim = weight_domain.hi()[2] - weight_domain.lo()[2] + 1;
+  size_t qParas = m->qProjSize * m->qSize;
+  size_t kParas = m->kProjSize * m->kSize;
+  size_t vParas = m->vProjSize * m->vSize;
+  size_t oParas = m->oProjSize * (m->vProjSize > 0 ? m->vProjSize : m->vSize);
+
+  assert(all_weight_params == qParas + kParas + vParas + oParas);
+  assert(num_q_heads == m->num_q_heads);
+  assert(replica_dim == 1);
+
+  assert(m->qSize == m->kSize && m->kSize == m->vSize);
+  // printf("m->qSize: %i\n", m->qSize);
+  //  keep things simple for now
+  assert(m->qProjSize == m->kProjSize && m->kProjSize == m->vProjSize);
+  long int proj_sum = m->qProjSize + m->kProjSize + m->vProjSize;
+  // load weight manually because Torch can't easily read a tensor serialized in
+  // column-major order.
+
+  // printf("m->kProjSize: %i, BatchConfig::MAX_NUM_TOKENS: %i, "
+  //     "bc->num_active_tokens(): %i, num_q_heads: %lli,
+  //     BatchConfig::MAX_NUM_REQUESTS: %i, " "bc->num_active_requests(): %i\n",
+  //     m->kProjSize, BatchConfig::MAX_NUM_TOKENS, bc->num_active_tokens(),
+  //     num_q_heads, BatchConfig::MAX_NUM_REQUESTS, bc->num_active_requests());
+  // for (int t=0; t < bc->num_active_tokens(); t++) {
+  //   printf("token %i has request_index: %li and token_position: %li\n",
+  //   t, bc->token2ids.token_indexes[t].request_index,
+  //   bc->token2ids.token_indexes[t].token_position);
+  // }
+
+  // =============================================================================
+  //  Load the output tensor (with CUDA results), and create a Torch tensor
+  // =============================================================================
+
+  float output_cuda[m->oProjSize][effective_batch_size] = {0};
+  for (int i = 0; i < m->oProjSize * effective_batch_size; i++) {
+    int row_idx = i % m->oProjSize;
+    int col_idx = i / m->oProjSize;
+    assert(row_idx < m->oProjSize && col_idx < effective_batch_size);
+    output_cuda[row_idx][col_idx] = output_cpu[i];
+  }
+  torch::Tensor torch_out_cuda =
+      torch::from_blob(output_cuda,
+                       {m->oProjSize, (int64_t)effective_batch_size},
+                       torch::kFloat32);
+
+  // =============================================================================
+  //  Load the Q/K/V projection weights, and create a Torch tensor
+  // =============================================================================
+  std::vector<int> w_qkv_shape = {m->qSize, m->qProjSize, 3, (int)num_q_heads};
+  float *w_qkv =
+      (float *)calloc(m->qSize * m->qProjSize * 3 * num_q_heads, sizeof(float));
+  assert(w_qkv[0] == 0.0f);
+
+  for (int h = 0; h < num_q_heads; h++) {
+    for (size_t i = 0; i < m->qProjSize * m->qSize; i++) {
+      int row_index = i % m->qSize;
+      int column_index = i / m->qSize;
+      // Q
+      set_value_row_major(w_qkv,
+                          w_qkv_shape,
+                          {row_index, column_index, 0, h},
+                          weight_cpu[all_weight_params * h +
+                                     m->qSize * column_index + row_index]);
+      // K
+      set_value_row_major(
+          w_qkv,
+          w_qkv_shape,
+          {row_index, column_index, 1, h},
+          weight_cpu[all_weight_params * h + m->qProjSize * m->qSize +
+                     m->qSize * column_index + row_index]);
+      // V
+      set_value_row_major(
+          w_qkv,
+          w_qkv_shape,
+          {row_index, column_index, 2, h},
+          weight_cpu[all_weight_params * h + 2 * m->qProjSize * m->qSize +
+                     m->qSize * column_index + row_index]);
+    }
+  }
+  // convert weights to torch tensor
+  torch::Tensor torch_w_qkv = torch::from_blob(
+      w_qkv, {m->qSize, m->qProjSize, 3, (int)num_q_heads}, torch::kFloat32);
+
+  /* std::cout << "Torch projection weights size: " << torch_w_qkv.sizes()
+            << std::endl;
+  std::cout << "Torch input size: " << torch_input.sizes() << std::endl;
+  std::cout << "Number of active tokens: " << bc->num_active_tokens()
+            << std::endl; */
+  // std::cout << "torch_w_qkv:" << std::endl << torch_w_qkv << std::endl;
+
+  // =============================================================================
+  //  Compute the Q/K/V projections, and compare the results with CUDA
+  // =============================================================================
+
+  //  ----------------------- C++ computations & checks ------------------------
+  torch::Tensor qkv_projs = torch::einsum(
+      "ijkl,im->jmkl",
+      {torch_w_qkv,
+       torch_input.index({Slice(), Slice(0, bc->num_active_tokens())})});
+  // std::cout << "qkv_projs size: " << qkv_projs.sizes() << std::endl;
+  assert(qkv_projs.sizes()[0] == m->qProjSize);
+  assert(qkv_projs.sizes()[1] == bc->num_active_tokens() &&
+         qkv_projs.sizes()[1] <= effective_batch_size);
+  assert(qkv_projs.sizes()[2] == 3);
+  assert(qkv_projs.sizes()[3] == num_q_heads);
+  free(w_qkv);
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *QKVProjArray_cpu = download_tensor<float>(
+      m->devQKVProjArray,
+      BatchConfig::MAX_NUM_TOKENS * proj_sum * m->num_q_heads);
+  assert(QKVProjArray_cpu != nullptr);
+
+  std::vector<int> QKVProjArray_converted_shape = {
+      m->qProjSize, bc->num_active_tokens(), 3, (int)num_q_heads};
+  float *QKVProjArray_converted = (float *)calloc(
+      m->qProjSize * bc->num_active_tokens() * 3 * num_q_heads, sizeof(float));
+
+  // skip over padding at the end of QKVProjArray_cpu
+  // convert from column order to 3D matrix because torch cannot automatically
+  // import matrices flattened in column order
+  for (size_t i = 0; i < proj_sum * bc->num_active_tokens() * num_q_heads;
+       i++) {
+    int proj_size_index = i % m->qProjSize;
+    int head_index = i / (proj_sum * bc->num_active_tokens());
+    int token_index =
+        ((i - head_index * proj_sum * bc->num_active_tokens()) / m->qProjSize) %
+        bc->num_active_tokens();
+    int qkv_offset = (i - head_index * proj_sum * bc->num_active_tokens()) /
+                     (m->qProjSize * bc->num_active_tokens());
+    assert(proj_size_index < proj_sum);
+    assert(head_index < num_q_heads);
+    assert(token_index < bc->num_active_tokens());
+    assert(qkv_offset < 3);
+    set_value_row_major(QKVProjArray_converted,
+                        QKVProjArray_converted_shape,
+                        {proj_size_index, token_index, qkv_offset, head_index},
+                        QKVProjArray_cpu[i]);
+  }
+  torch::Tensor QKVProjArray_torch =
+      torch::from_blob(QKVProjArray_converted,
+                       {m->qProjSize, bc->num_active_tokens(), 3, num_q_heads},
+                       torch::kFloat32);
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+  // std::cout << "QKVProjArray_torch" << std::endl;
+  // for (int i=0; i<num_q_heads; i++) {
+  //   for (int j=0; j<3; j++) {
+  //     std::cout << QKVProjArray_torch.index({Slice(), Slice(), j, i}) <<
+  //     std::endl;
+  //   }
+  // }
+  // std::cout << "qkv_projs" << std::endl;
+  // for (int i=0; i<num_q_heads; i++) {
+  //   for (int j=0; j<3; j++) {
+  //     std::cout << qkv_projs.index({Slice(), Slice(), j, i}) << std::endl;
+  //   }
+  // }
+  assert(torch::allclose(QKVProjArray_torch, qkv_projs, 1e-05, 1e-05));
+  free(QKVProjArray_converted);
+
+  // =============================================================================
+  //  Store the K/V projections into the cache
+  // =============================================================================
+
+  //  ----------------------- C++ operations & checks --------------------------
+  // Store projections into k/v cache arrays
+  for (size_t h = 0; h < num_q_heads; h++) {
+    for (size_t t = 0; t < bc->num_active_tokens(); t++) {
+      for (size_t d = 0; d < m->kProjSize; d++) {
+        size_t kcache_idx =
+            d * MAX_SEQ_LEN * m->num_q_heads * BatchConfig::MAX_NUM_REQUESTS +
+            bc->tokensInfo[t].abs_depth_in_request * m->num_q_heads *
+                BatchConfig::MAX_NUM_REQUESTS +
+            h * BatchConfig::MAX_NUM_REQUESTS + bc->tokensInfo[t].request_index;
+        m->kcache[kcache_idx] =
+            qkv_projs.index({(int64_t)d, (int64_t)t, 1, (int64_t)h})
+                .item<float>();
+      }
+      for (size_t d = 0; d < m->vProjSize; d++) {
+        size_t vcache_idx =
+            d * MAX_SEQ_LEN * m->num_q_heads * BatchConfig::MAX_NUM_REQUESTS +
+            bc->tokensInfo[t].abs_depth_in_request * m->num_q_heads *
+                BatchConfig::MAX_NUM_REQUESTS +
+            h * BatchConfig::MAX_NUM_REQUESTS + bc->tokensInfo[t].request_index;
+        m->vcache[vcache_idx] =
+            qkv_projs.index({(int64_t)d, (int64_t)t, 2, (int64_t)h})
+                .item<float>();
+      }
+    }
+  }
+  // Create torch tensors from the arrays
+  torch::Tensor K_t = torch::from_blob(
+      m->kcache,
+      {m->kProjSize, MAX_SEQ_LEN, num_q_heads, BatchConfig::MAX_NUM_REQUESTS},
+      torch::kFloat32);
+  torch::Tensor V_t = torch::from_blob(
+      m->vcache,
+      {m->vProjSize, MAX_SEQ_LEN, num_q_heads, BatchConfig::MAX_NUM_REQUESTS},
+      torch::kFloat32);
+
+  // Compute useful indices
+  std::vector<size_t> req_idxs;
+  std::vector<size_t> r_first_idx;
+  std::vector<size_t> r_num_tokens;
+  for (size_t t = 0; t < bc->num_active_tokens(); t++) {
+    size_t rid = bc->tokensInfo[t].request_index;
+    if (req_idxs.size() == 0 || req_idxs[req_idxs.size() - 1] != rid) {
+      req_idxs.push_back(rid);
+      r_first_idx.push_back(t);
+      r_num_tokens.push_back(1);
+    } else {
+      r_num_tokens[r_num_tokens.size() - 1]++;
+    }
+    assert(req_idxs.size() == r_first_idx.size() &&
+           r_first_idx.size() == r_num_tokens.size());
+  }
+  assert(req_idxs.size() == bc->num_active_requests());
+  assert(std::accumulate(r_num_tokens.begin(),
+                         r_num_tokens.end(),
+                         decltype(r_num_tokens)::value_type(0)) ==
+         bc->num_active_tokens());
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *keyCache_cpu =
+      download_tensor<float>(m->keyCache,
+                             m->num_q_heads * m->kProjSize *
+                                 BatchConfig::MAX_NUM_REQUESTS * MAX_SEQ_LEN);
+  float *valueCache_cpu =
+      download_tensor<float>(m->valueCache,
+                             m->num_q_heads * m->vProjSize *
+                                 BatchConfig::MAX_NUM_REQUESTS * MAX_SEQ_LEN);
+  assert(keyCache_cpu != nullptr);
+  assert(valueCache_cpu != nullptr);
+
+  float *kcache_cuda =
+      (float *)calloc(m->kProjSize * MAX_SEQ_LEN * m->num_q_heads *
+                          BatchConfig::MAX_NUM_REQUESTS,
+                      sizeof(float));
+  float *vcache_cuda =
+      (float *)calloc(m->vProjSize * MAX_SEQ_LEN * m->num_q_heads *
+                          BatchConfig::MAX_NUM_REQUESTS,
+                      sizeof(float));
+  int index = 0;
+  for (int i = 0; i < m->kProjSize; i++) {
+    for (int j = 0; j < MAX_SEQ_LEN; j++) {
+      for (int k = 0; k < m->num_q_heads; k++) {
+        for (int l = 0; l < BatchConfig::MAX_NUM_REQUESTS; l++) {
+          int col_major_index =
+              l * m->kProjSize * MAX_SEQ_LEN * m->num_q_heads +
+              k * m->kProjSize * MAX_SEQ_LEN + j * m->kProjSize + i;
+          kcache_cuda[index++] = keyCache_cpu[col_major_index];
+        }
+      }
+    }
+  }
+  index = 0;
+  for (int i = 0; i < m->vProjSize; i++) {
+    for (int j = 0; j < MAX_SEQ_LEN; j++) {
+      for (int k = 0; k < m->num_q_heads; k++) {
+        for (int l = 0; l < BatchConfig::MAX_NUM_REQUESTS; l++) {
+          int col_major_index =
+              l * m->vProjSize * MAX_SEQ_LEN * m->num_q_heads +
+              k * m->vProjSize * MAX_SEQ_LEN + j * m->vProjSize + i;
+          vcache_cuda[index++] = valueCache_cpu[col_major_index];
+        }
+      }
+    }
+  }
+  torch::Tensor K_t_cuda = torch::from_blob(
+      kcache_cuda,
+      {m->kProjSize, MAX_SEQ_LEN, num_q_heads, BatchConfig::MAX_NUM_REQUESTS},
+      torch::kFloat32);
+  torch::Tensor V_t_cuda = torch::from_blob(
+      vcache_cuda,
+      {m->vProjSize, MAX_SEQ_LEN, num_q_heads, BatchConfig::MAX_NUM_REQUESTS},
+      torch::kFloat32);
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+
+  // std::cout << "kcache differences:" << std::endl;
+  // for (int i=0; i < bc->num_active_requests() + 1; i++) {
+  //   for (int j=0; j < num_q_heads; j++) {
+  //     for (int l=0; l < m->kProjSize; l++) {
+  //       for (int k=0; k < MAX_SEQ_LEN; k++) {
+  //         size_t kcache_idx =
+  //           l * MAX_SEQ_LEN * num_q_heads * BatchConfig::MAX_NUM_REQUESTS +
+  //           k * num_q_heads * BatchConfig::MAX_NUM_REQUESTS +
+  //           j * BatchConfig::MAX_NUM_REQUESTS +
+  //           i;
+  //           if ( abs(m->kcache[kcache_idx] - keyCache_cpu[
+  //               i * m->kProjSize * MAX_SEQ_LEN * num_q_heads +
+  //               j * m->kProjSize * MAX_SEQ_LEN +
+  //               k * m->kProjSize +
+  //               l
+  //           ]) > 0.00001) {
+  //             printf("req: %i (rid: %i), head: %i, data_dim: %i, token_pos:
+  //             %i\n",
+  //                   i, req_idxs[i], j, l, k);
+  //           }
+  //       }
+  //     }
+  //   }
+  // }
+
+  //  std::cout << "keyCache from CUDA:" << std::endl;
+  //  for (int i=0; i<bc->num_active_requests()+1; i++) {
+  //    for (int j=0; j<num_q_heads; j++) {
+  //     for (int l=0; l<m->kProjSize; l++) {
+  //       for (int k=0; k< MAX_SEQ_LEN; k++) {
+  //         printf("%f ",
+  //           keyCache_cpu[i * m->kProjSize * MAX_SEQ_LEN * num_q_heads +
+  //               j * m->kProjSize * MAX_SEQ_LEN +
+  //               k * m->kProjSize +
+  //               l
+  //         ]);
+  //       }
+  //       printf("\n");
+  //     }
+  //     printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  //  std::cout << "valueCache from CUDA:" << std::endl;
+  //  for (int i=0; i<bc->num_active_requests()+1; i++) {
+  //    for (int j=0; j<num_q_heads; j++) {
+  //       for (int l=0; l<m->vProjSize; l++) {
+  //         for (int k=0; k< MAX_SEQ_LEN; k++) {
+  //           printf("%f ",
+  //             valueCache_cpu[
+  //                 i * m->vProjSize * MAX_SEQ_LEN * num_q_heads +
+  //                 j * m->vProjSize * MAX_SEQ_LEN +
+  //                 k * m->vProjSize +
+  //             l]);
+  //         }
+  //         printf("\n");
+  //       }
+  //       printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  //  printf("\n");
+
+  //  std::cout << "C++ kcache:" << std::endl;
+  //  for (int i=0; i<bc->num_active_requests()+1; i++) {
+  //    for (int j=0; j < num_q_heads; j++) {
+  //       for (int l=0; l < m->kProjSize; l++) {
+  //         for (int k=0; k < MAX_SEQ_LEN; k++) {
+  //           size_t kcache_idx =
+  //             l * MAX_SEQ_LEN * num_q_heads * BatchConfig::MAX_NUM_REQUESTS +
+  //             k * num_q_heads * BatchConfig::MAX_NUM_REQUESTS +
+  //             j * BatchConfig::MAX_NUM_REQUESTS +
+  //             i;
+  //           printf("%f ", m->kcache[kcache_idx]);
+  //         }
+  //         printf("\n");
+  //       }
+  //       printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  //  std::cout << "C++ vcache:" << std::endl;
+  //  for (int i=0; i<bc->num_active_requests()+1; i++) {
+  //    for (int j=0; j<num_q_heads; j++) {
+  //       for (int l=0; l<m->vProjSize; l++) {
+  //         for (int k=0; k< MAX_SEQ_LEN; k++) {
+  //             size_t vcache_idx =
+  //               l * MAX_SEQ_LEN * num_q_heads * BatchConfig::MAX_NUM_REQUESTS
+  //               + k * num_q_heads * BatchConfig::MAX_NUM_REQUESTS + j *
+  //               BatchConfig::MAX_NUM_REQUESTS + i;
+  //             printf("%f ", m->vcache[vcache_idx]);
+  //         }
+  //         printf("\n");
+  //       }
+  //       printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  assert(torch::allclose(K_t_cuda, K_t, 1e-05, 1e-05));
+  assert(torch::allclose(V_t_cuda, V_t, 1e-05, 1e-05));
+  free(kcache_cuda);
+  free(vcache_cuda);
+
+  // =============================================================================
+  //  Load the W_out projection weights
+  // =============================================================================
+
+  //  ----------------------- C++ operations & checks --------------------------
+  float *w_out = (float *)calloc(m->vProjSize * m->num_q_heads * m->oProjSize,
+                                 sizeof(float));
+  std::vector<int> w_out_shape = {m->vProjSize, m->num_q_heads, m->oProjSize};
+  assert(m->qProjSize == m->kProjSize && m->kProjSize == m->vProjSize);
+  for (int h = 0; h < num_q_heads; h++) {
+    for (int v = 0; v < m->vProjSize; v++) {
+      for (int o = 0; o < m->oProjSize; o++) {
+        set_value_row_major(
+            w_out,
+            w_out_shape,
+            {v, h, o},
+            weight_cpu[all_weight_params * h + 3 * m->qProjSize * m->qSize +
+                       m->vProjSize * o + v]);
+      }
+    }
+  }
+  // convert weights to torch tensor
+  torch::Tensor torch_w_out = torch::from_blob(
+      w_out, {m->vProjSize, m->num_q_heads, m->oProjSize}, torch::kFloat32);
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *w_out_cuda = download_tensor<float>(
+      m->W_out_contiguous, m->vProjSize * m->oProjSize * m->num_q_heads);
+  assert(w_out_cuda != nullptr);
+  float *converted_wout_tensor = (float *)calloc(
+      m->vProjSize * m->num_q_heads * m->oProjSize, sizeof(float));
+  std::vector<int> converted_wout_tensor_shape = {
+      m->vProjSize, m->num_q_heads, m->oProjSize};
+
+  for (int i = 0; i < m->vProjSize * m->num_q_heads * m->oProjSize; i++) {
+    int v_idx = i % m->vProjSize;
+    int h_idx = (i / m->vProjSize) % m->num_q_heads;
+    int o_idx = i / (m->vProjSize * m->num_q_heads);
+    assert(v_idx < m->vProjSize && h_idx < m->num_q_heads &&
+           o_idx < m->oProjSize);
+    set_value_row_major(converted_wout_tensor,
+                        converted_wout_tensor_shape,
+                        {v_idx, h_idx, o_idx},
+                        w_out_cuda[i]);
+  }
+  torch::Tensor w_out_cuda_tensor =
+      torch::from_blob(converted_wout_tensor,
+                       {m->vProjSize, m->num_q_heads, m->oProjSize},
+                       torch::kFloat32);
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+  assert(torch::allclose(w_out_cuda_tensor, torch_w_out, 1e-05, 1e-05));
+  free(converted_wout_tensor);
+
+  // =============================================================================
+  //  Compute the softmax(QK^T/sqrt(d_k))V product, request by request
+  // =============================================================================
+
+  //  ----------------------- C++ initialization steps -------------------------
+  torch::Tensor Q_projs = qkv_projs.index({Slice(), Slice(), 0, Slice()})
+                              .reshape({qkv_projs.sizes()[0],
+                                        qkv_projs.sizes()[1],
+                                        qkv_projs.sizes()[3]});
+
+  torch::Tensor qk_products[bc->num_active_requests()];
+  torch::Tensor qk_softmax[bc->num_active_requests()];
+  torch::Tensor attn_heads[bc->num_active_requests()];
+
+  torch::Tensor cpp_output =
+      torch::zeros({m->oProjSize, bc->num_active_tokens()});
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *qk_prods_cpu = download_tensor<float>(
+      m->qk_prods,
+      BatchConfig::MAX_NUM_TOKENS * BatchConfig::MAX_NUM_TOKENS * num_q_heads);
+  assert(qk_prods_cpu != nullptr);
+
+  float *qk_prods_softmax_cpu = download_tensor<float>(
+      m->qk_prods_softmax,
+      BatchConfig::MAX_NUM_TOKENS * BatchConfig::MAX_NUM_TOKENS * num_q_heads);
+  assert(qk_prods_softmax_cpu != nullptr);
+
+  float *attn_heads_cpu = download_tensor<float>(
+      m->attn_heads,
+      BatchConfig::MAX_NUM_TOKENS * m->num_q_heads * m->vProjSize);
+  assert(attn_heads_cpu != nullptr);
+
+  //  ----------------------- Main loop (request by request) -------------------
+  size_t qk_prods_cpu_offset = 0;
+
+  for (size_t r = 0; r < bc->num_active_requests(); r++) {
+    // Compute pre-request parameters
+    size_t num_new_tokens = r_num_tokens[r];
+    int64_t rid = (int64_t)(req_idxs[r]);
+    int64_t num_tokens_received_so_far =
+        (int64_t)(bc->requestsInfo[rid].token_start_offset +
+                  bc->requestsInfo[rid].num_tokens_in_batch);
+    assert(num_new_tokens == bc->requestsInfo[rid].num_tokens_in_batch);
+    assert(num_tokens_received_so_far >= (int64_t)num_new_tokens);
+
+    //  ----------------------- C++ computations -------------------------------
+    // Get the slice of the Q projection tensor with the tokens in the current
+    // request
+    torch::Tensor Q_req =
+        Q_projs.index({Slice(),
+                       Slice(r_first_idx[r], r_first_idx[r] + num_new_tokens),
+                       Slice()});
+    // std::cout << "Q_req.sizes(): " << Q_req.sizes() << std::endl;
+    assert(Q_req.sizes()[0] == m->qProjSize);
+    assert(Q_req.sizes()[1] == num_new_tokens);
+    assert(Q_req.sizes()[2] == num_q_heads);
+
+    /*printf("\n------------ QK multiplication (C++) -------------\n");
+    printf("Request r=%lu. num_new_tokens: %lu, num_tokens_received_so_far: %li,
+    rid: %li, Qproj slice: (%i, %i)\n", r, num_new_tokens,
+    num_tokens_received_so_far, rid, r_first_idx[r], r_first_idx[r] +
+    num_new_tokens);
+
+    std::cout << "Q_req matrix (idk dims):" << std::endl <<
+    Q_req.index({Slice(), Slice(), 0}) << std::endl << std::endl; std::cout <<
+    "K_t matrix (ilk dims):" << std::endl << K_t.index({Slice(), Slice(0,
+    num_tokens_received_so_far), 0, rid}) << std::endl << std::endl; std::cout
+    << "C++ alpha: " << (1.0f / sqrt(m->kProjSize)) << std::endl;*/
+
+    // Compute (Q*K^T)/sqrt(d_k) matmul
+    qk_products[r] =
+        torch::einsum("ijk,ilk->jlk",
+                      {Q_req,
+                       K_t.index({Slice(),
+                                  Slice(0, num_tokens_received_so_far),
+                                  Slice(),
+                                  rid})}) *
+        (1.0f / sqrt(m->kProjSize));
+
+    // Set entries above diagonal to -inf to make attention causal.
+    for (int h = 0; h < num_q_heads; h++) {
+      qk_products[r].index(
+          {Slice(), Slice(num_tokens_received_so_far - num_new_tokens), h}) =
+          qk_products[r]
+              .index({Slice(),
+                      Slice(num_tokens_received_so_far - num_new_tokens),
+                      h})
+              .tril() +
+          torch::full({(int64_t)num_new_tokens, (int64_t)num_new_tokens},
+                      -INFINITY)
+              .triu()
+              .fill_diagonal_(0);
+    }
+    // Compute softmax for each request block
+    qk_softmax[r] = torch::softmax(qk_products[r], -2);
+    assert(qk_softmax[r].sizes()[0] == num_new_tokens);
+    assert(qk_softmax[r].sizes()[1] == num_tokens_received_so_far);
+    assert(qk_softmax[r].sizes()[2] == m->num_q_heads);
+
+    //  ------------------- Loading CUDA results for this step ---------------
+    float *converted_qk_prod = (float *)calloc(
+        num_new_tokens * num_tokens_received_so_far * num_q_heads,
+        sizeof(float));
+    float *converted_qk_prod_softmax = (float *)calloc(
+        num_new_tokens * num_tokens_received_so_far * num_q_heads,
+        sizeof(float));
+    std::vector<int> converted_qk_prod_shape = {
+        (int)num_new_tokens, (int)num_tokens_received_so_far, (int)num_q_heads};
+
+    for (size_t i = 0;
+         i < num_new_tokens * num_tokens_received_so_far * num_q_heads;
+         i++) {
+      size_t new_t_idx = i % num_new_tokens;
+      size_t all_t_idx = (i / num_new_tokens) % num_tokens_received_so_far;
+      size_t head_idx = i / (num_new_tokens * num_tokens_received_so_far);
+      assert(new_t_idx < num_new_tokens &&
+             all_t_idx < num_tokens_received_so_far && head_idx < num_q_heads);
+      set_value_row_major(converted_qk_prod,
+                          converted_qk_prod_shape,
+                          {(int)new_t_idx, (int)all_t_idx, (int)head_idx},
+                          qk_prods_cpu[i + qk_prods_cpu_offset]);
+      set_value_row_major(converted_qk_prod_softmax,
+                          converted_qk_prod_shape,
+                          {(int)new_t_idx, (int)all_t_idx, (int)head_idx},
+                          qk_prods_softmax_cpu[i + qk_prods_cpu_offset]);
+    }
+    torch::Tensor qk_prods_cuda = torch::from_blob(
+        converted_qk_prod,
+        {(int64_t)num_new_tokens, num_tokens_received_so_far, num_q_heads},
+        torch::kFloat32);
+    torch::Tensor qk_prods_softmax_cuda = torch::from_blob(
+        converted_qk_prod_softmax,
+        {(int64_t)num_new_tokens, num_tokens_received_so_far, num_q_heads},
+        torch::kFloat32);
+
+    //  ------------------- Comparing C++ & CUDA results ------------------
+    /* std::cout << "C++:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_products[r].index({Slice(), Slice(), h}) << std::endl;
+    }
+    std::cout << "CUDA:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_prods_cuda.index({Slice(), Slice(), h}) << std::endl;
+    } */
+    /* //
+    std::cout << "C++:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_softmax[r].index({Slice(), Slice(), h}) << std::endl;
+    }
+    std::cout << "CUDA:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_prods_softmax_cuda.index({Slice(), Slice(), h}) <<
+    std::endl;
+    } */
+    // std::cout << "C++ tril:" <<std::endl;
+    // for (int h=0; h<num_q_heads; h++) {
+    //   std::cout << qk_products[r].tril().index({Slice(), Slice(), h}) <<
+    //   std::endl;
+    // }
+    assert(torch::allclose(qk_prods_cuda, qk_products[r], 1e-05, 1e-05));
+    assert(torch::allclose(qk_prods_softmax_cuda, qk_softmax[r], 1e-05, 1e-05));
+    free(converted_qk_prod);
+    free(converted_qk_prod_softmax);
+
+    //  --------------------- C++ computations --------------------------
+    // Multiply softmax results by V
+    assert(
+        V_t.index({Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})
+            .sizes()[0] == m->vProjSize);
+    assert(
+        V_t.index({Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})
+            .sizes()[1] == num_tokens_received_so_far);
+    assert(
+        V_t.index({Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})
+            .sizes()[2] == m->num_q_heads);
+    attn_heads[r] = torch::einsum(
+        "ijk,ljk->ilk",
+        {qk_softmax[r],
+         V_t.index(
+             {Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})});
+    assert(attn_heads[r].sizes()[0] == num_new_tokens);
+    assert(attn_heads[r].sizes()[1] == m->vProjSize);
+    assert(attn_heads[r].sizes()[2] == m->num_q_heads);
+
+    //  ------------------- Loading CUDA results for this step  ---------------
+    float converted_attn_heads_cpu[num_new_tokens][m->vProjSize]
+                                  [m->num_q_heads] = {0};
+    for (int i = 0; i < num_new_tokens * m->vProjSize * m->num_q_heads; i++) {
+      int token_ix = i % num_new_tokens;
+      int vproj_idx = (i / num_new_tokens) % m->vProjSize;
+      int head_idx = i / (num_new_tokens * m->vProjSize);
+      assert(token_ix < num_new_tokens && vproj_idx < m->vProjSize &&
+             head_idx < m->num_q_heads);
+      converted_attn_heads_cpu[token_ix][vproj_idx][head_idx] =
+          attn_heads_cpu[r_first_idx[r] * m->vProjSize * m->num_q_heads + i];
+    }
+    torch::Tensor converted_attn_heads_cuda = torch::from_blob(
+        converted_attn_heads_cpu,
+        {(int64_t)num_new_tokens, m->vProjSize, m->num_q_heads},
+        torch::kFloat32);
+
+    //  -------------------- Comparing C++ & CUDA results -------------------
+    /* std::cout << "CUDA attn head for req " << r << ":" <<std::endl;
+    for (int h=0; h<m->num_q_heads; h++) {
+      std::cout << converted_attn_heads_cuda.index({Slice(), Slice(), h}) <<
+    std::endl;
+    }
+    std::cout << "C++ attn head for req " << r << ":" <<std::endl;
+    for (int h=0; h<m->num_q_heads; h++) {
+      std::cout << attn_heads[r].index({Slice(), Slice(), h}) << std::endl;
+    } */
+    assert(torch::allclose(
+        converted_attn_heads_cuda, attn_heads[r], 1e-05, 1e-05));
+
+    //  ----------------------- C++ computations ----------------------------
+    // Compute output values by projecting all heads to output space
+    cpp_output.index(
+        {Slice(),
+         Slice(r_first_idx[r], r_first_idx[r] + (int64_t)num_new_tokens)}) =
+        torch::einsum("jkl,ijk->li", {torch_w_out, attn_heads[r]});
+
+    // increment main loop's auxiliary index
+    qk_prods_cpu_offset +=
+        num_new_tokens * num_tokens_received_so_far * num_q_heads;
+  }
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+  /* std::cout << "C++:" <<std::endl;
+  for (int i=0; i<m->oProjSize; i++) {
+    std::cout << cpp_output.index({i, Slice()}) << std::endl;
+  }
+  std::cout << "CUDA:" <<std::endl;
+  for (int i=0; i<m->oProjSize; i++) {
+    std::cout << torch_out_cuda.index({i, Slice(0,
+  (int64_t)bc->num_active_tokens())}) << std::endl;
+  } */
+
+  assert(torch::allclose(
+      torch_out_cuda.index(
+          {Slice(), Slice(0, (int64_t)bc->num_active_tokens())}),
+      cpp_output,
+      1e-05,
+      1e-05));
+
+  // =============================================================================
+  //  Cleanup
+  // =============================================================================
+  free(w_out);
+  checkCUDA(cudaFreeHost(input_cpu));
+  checkCUDA(cudaFreeHost(weight_cpu));
+  checkCUDA(cudaFreeHost(output_cpu));
+  checkCUDA(cudaFreeHost(QKVProjArray_cpu));
+  checkCUDA(cudaFreeHost(keyCache_cpu));
+  checkCUDA(cudaFreeHost(valueCache_cpu));
+  checkCUDA(cudaFreeHost(qk_prods_cpu));
+  checkCUDA(cudaFreeHost(qk_prods_softmax_cpu));
+  checkCUDA(cudaFreeHost(attn_heads_cpu));
+  checkCUDA(cudaFreeHost(w_out_cuda));
+  // assert(false && "All good if you see this assert failure! :)");
+#endif
+  // Done with INFERENCE_TESTS block
+}
+
+void IncMultiHeadSelfAttention::backward(FFModel const &ff) {
+  // IncMultiHeadSelfAttention does not support backward
+  assert(false);
+}
+
+bool IncMultiHeadSelfAttention::get_int_parameter(PMParameter para,
+                                                  int *value) const {
+  switch (para) {
+    case PM_NUM_HEADS:
+      *value = num_q_heads;
+      return true;
+    default:
+      return Op::get_int_parameter(para, value);
+  }
+}
+
+bool IncMultiHeadSelfAttention::measure_operator_cost(
+    Simulator *sim, MachineView const &mv, CostMetrics &cost_metrics) const {
+  return false;
+}
+
+bool operator==(IncMultiHeadSelfAttentionParams const &lhs,
+                IncMultiHeadSelfAttentionParams const &rhs) {
+  return lhs.layer_guid == rhs.layer_guid && lhs.embed_dim == rhs.embed_dim &&
+         lhs.num_q_heads == rhs.num_q_heads && lhs.kdim == rhs.kdim &&
+         lhs.vdim == rhs.vdim && lhs.dropout == rhs.dropout &&
+         lhs.bias == rhs.bias && lhs.add_bias_kv == rhs.add_bias_kv &&
+         lhs.add_zero_attn == rhs.add_zero_attn &&
+         lhs.apply_rotary_embedding == rhs.apply_rotary_embedding &&
+         lhs.scaling_query == rhs.scaling_query &&
+         lhs.scaling_factor == rhs.scaling_factor &&
+         lhs.qk_prod_scaling == rhs.qk_prod_scaling;
+}
+
+IncMultiHeadSelfAttentionParams IncMultiHeadSelfAttention::get_params() const {
+  IncMultiHeadSelfAttentionParams params;
+  params.layer_guid = this->layer_guid;
+  params.embed_dim = this->oProjSize;
+  params.num_q_heads = this->num_q_heads;
+  params.kdim = this->kProjSize;
+  params.vdim = this->vProjSize;
+  params.dropout = this->dropout;
+  params.bias = this->bias;
+  params.add_bias_kv = this->add_bias_kv;
+  params.add_zero_attn = this->add_zero_attn;
+  params.apply_rotary_embedding = this->apply_rotary_embedding;
+  params.scaling_query = this->scaling_query;
+  params.scaling_factor = this->scaling_factor;
+  params.qk_prod_scaling = this->qk_prod_scaling;
+  params.tensor_parallelism_degree = this->tensor_parallelism_degree,
+  params.quantization_type = this->quantization_type;
+  params.offload = this->offload;
+  params.num_kv_heads = this->num_kv_heads;
+
+  return params;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::IncMultiHeadSelfAttentionParams>::operator()(
+    FlexFlow::IncMultiHeadSelfAttentionParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.embed_dim);
+  hash_combine(key, params.num_q_heads);
+  hash_combine(key, params.num_kv_heads);
+  hash_combine(key, params.kdim);
+  hash_combine(key, params.vdim);
+  hash_combine(key, params.dropout);
+  hash_combine(key, params.bias);
+  hash_combine(key, params.add_bias_kv);
+  hash_combine(key, params.add_zero_attn);
+  hash_combine(key, params.apply_rotary_embedding);
+  hash_combine(key, params.scaling_query);
+  hash_combine(key, params.scaling_factor);
+  hash_combine(key, params.qk_prod_scaling);
+  hash_combine(key, params.quantization_type);
+  hash_combine(key, params.offload);
+  hash_combine(key, params.tensor_parallelism_degree);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/inc_multihead_self_attention.cpp b/src/ops/inc_multihead_self_attention.cpp
new file mode 100644
index 0000000000..b7ed189040
--- /dev/null
+++ b/src/ops/inc_multihead_self_attention.cpp
@@ -0,0 +1,109 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/inc_multihead_self_attention.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+
+/*static*/
+void IncMultiHeadSelfAttention::inference_kernel_wrapper(
+    IncMultiHeadSelfAttentionMeta const *m,
+    BatchConfig const *bc,
+    int shard_id,
+    GenericTensorAccessorR const &input,
+    GenericTensorAccessorR const &weight,
+    GenericTensorAccessorW const &output,
+    GenericTensorAccessorR const &bias) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  handle_unimplemented_hip_kernel(OP_INC_MULTIHEAD_SELF_ATTENTION);
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+    printf("IncMultiHeadSelfAttention forward time = %.2fms\n", elapsed);
+    // print_tensor<3, float>(acc_query.ptr, acc_query.rect,
+    // "[Attention:forward:query]"); print_tensor<3, float>(acc_output.ptr,
+    // acc_output.rect, "[Attention:forward:output]");
+  }
+}
+
+IncMultiHeadSelfAttentionMeta::IncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    IncMultiHeadSelfAttention const *attn,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _num_q_heads,
+    int _num_kv_heads)
+    : OpMeta(handler, attn) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(miopenSetStream(handler.dnn, stream));
+}
+
+IncMultiHeadSelfAttentionMeta::IncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    InferenceMode infer_mode,
+    Op const *attn,
+    int _qSize,
+    int _kSize,
+    int _vSize,
+    int _qProjSize,
+    int _kProjSize,
+    int _vProjSize,
+    int _oProjSize,
+    bool _apply_rotary_embedding,
+    bool _bias,
+    bool _scaling_query,
+    bool _qk_prod_scaling,
+    bool _add_bias_kv,
+    float _scaling_factor,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _global_num_q_heads,
+    int _global_num_kv_heads,
+    int _num_q_heads,
+    int _num_kv_heads,
+    DataType _quantization_type,
+    bool _offload)
+    : OpMeta(handler, attn) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(miopenSetStream(handler.dnn, stream));
+}
+
+IncMultiHeadSelfAttentionMeta::~IncMultiHeadSelfAttentionMeta(void) {}
+
+}; // namespace FlexFlow
diff --git a/src/ops/inc_multihead_self_attention.cu b/src/ops/inc_multihead_self_attention.cu
new file mode 100644
index 0000000000..b694797830
--- /dev/null
+++ b/src/ops/inc_multihead_self_attention.cu
@@ -0,0 +1,1183 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "cuComplex.h"
+#endif
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
+#include "flexflow/ops/kernels/decompress_kernels.h"
+#include "flexflow/ops/kernels/inc_multihead_self_attention_kernels.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+
+namespace Kernels {
+namespace IncMultiHeadAttention {
+
+template <typename DT>
+__global__ void apply_proj_bias_w(DT *input_ptr,
+                                  DT const *bias_ptr,
+                                  int num_tokens,
+                                  int qkv_weight_size,
+                                  int oProjSize) {
+  CUDA_KERNEL_LOOP(i, num_tokens * oProjSize) {
+    int bias_idx = qkv_weight_size + i % oProjSize;
+    input_ptr[i] += bias_ptr[bias_idx];
+  }
+}
+
+template <typename DT>
+__global__ void apply_proj_bias_qkv(DT *input_ptr,
+                                    DT const *bias_ptr,
+                                    int shard_id,
+                                    int num_tokens,
+                                    int qProjSize,
+                                    int kProjSize,
+                                    int vProjSize,
+                                    int global_num_q_heads,
+                                    int global_num_kv_heads,
+                                    int num_q_heads,
+                                    int num_kv_heads,
+                                    bool scaling_query,
+                                    float scaling_factor) {
+  CUDA_KERNEL_LOOP(i,
+                   num_tokens *
+                       (qProjSize * num_q_heads + kProjSize * num_kv_heads +
+                        vProjSize * num_kv_heads)) {
+    // for simplicity, assume q, k, v is in same shape
+    // 0->q, 1->k, 2->v
+    // int qkv_index = i / (num_tokens * qProjSize) % 3;
+
+    int qkv_index = i < num_tokens * qProjSize * num_q_heads
+                        ? 0
+                        : (i < num_tokens * (qProjSize * num_q_heads +
+                                             kProjSize * num_kv_heads)
+                               ? 1
+                               : 2);
+
+    // int head_idx = i / (num_tokens * (qProjSize + kProjSize + vProjSize));
+    // int qkv_block_size = (qProjSize + kProjSize + vProjSize) * num_tokens;
+    int q_block_size = qProjSize * num_tokens * num_q_heads;
+    int k_block_size = kProjSize * num_tokens * num_kv_heads;
+
+    // int idx = i % (num_tokens * (qProjSize));
+
+    // int real_part_index =
+    //     head_idx * qkv_block_size + qkv_index * q_block_size + idx;
+    int bias_idx = 0;
+    if (qkv_index == 0) {
+      int head_idx = i / (num_tokens * qProjSize);
+      int global_head_idx = head_idx + shard_id * num_q_heads;
+      int global_i = i + shard_id * num_q_heads * num_tokens * qProjSize;
+      bias_idx = global_head_idx * qProjSize +
+                 (global_i % (num_tokens * (qProjSize)) % qProjSize);
+    } else {
+
+      int idx =
+          qkv_index == 1 ? i - q_block_size : i - q_block_size - k_block_size;
+      int pre_length = qkv_index == 1 ? qProjSize * global_num_q_heads
+                                      : qProjSize * global_num_q_heads +
+                                            kProjSize * global_num_kv_heads;
+
+      int head_idx = idx / (num_tokens * kProjSize);
+      int global_head_idx = head_idx + shard_id * num_kv_heads;
+      int global_idx = idx + shard_id * num_tokens * num_kv_heads * kProjSize;
+
+      bias_idx = pre_length + global_head_idx * kProjSize +
+                 (global_idx % (num_tokens * (qProjSize)) % qProjSize);
+    }
+    // int bias_idx = qkv_index * qProjSize * global_num_q_heads +
+    //                global_head_idx * qProjSize + (idx % qProjSize);
+
+    input_ptr[i] += bias_ptr[bias_idx];
+
+    if (scaling_query && qkv_index == 0) {
+      input_ptr[i] *= scaling_factor;
+    }
+  }
+}
+
+template <typename DT>
+__global__ void
+    apply_rotary_embedding_native(DT *input_ptr,
+                                  cuFloatComplex *complex_input,
+                                  BatchConfig::PerTokenInfo const *tokenInfos,
+                                  int qProjSize,
+                                  int kProjSize,
+                                  int num_q_heads,
+                                  int num_tokens,
+                                  int num_kv_heads,
+                                  int q_block_size,
+                                  int k_block_size,
+                                  int q_array_size) {
+  CUDA_KERNEL_LOOP(
+      i,
+      num_tokens * (qProjSize * num_q_heads + kProjSize * num_kv_heads) / 2) {
+    // create complex number
+    bool q_tensor = i < (q_array_size / 2);
+    int proj_size = q_tensor ? qProjSize : kProjSize;
+    int real_i = q_tensor ? i : i - q_array_size / 2;
+
+    int head_idx = real_i / (num_tokens * proj_size / 2);
+    int idx = real_i % (num_tokens * proj_size / 2);
+    int real_part_index = idx * 2 +
+                          head_idx * (q_tensor ? q_block_size : k_block_size) +
+                          (q_tensor ? 0 : q_array_size);
+
+    int complex_part_index = real_part_index + 1;
+
+    complex_input[i] = {input_ptr[real_part_index],
+                        input_ptr[complex_part_index]};
+
+    int token_idx =
+        (real_i - head_idx * (num_tokens * proj_size / 2)) / (proj_size / 2);
+    size_t pos = tokenInfos[token_idx].abs_depth_in_request;
+
+    // float before_real = complex_input[i].x, before_complex =
+    // complex_input[i].y;
+
+    int pos_i = real_i % (proj_size / 2);
+    float freq = pos * (1.0 / pow(10000.0, (float)2 * pos_i / proj_size));
+    cuFloatComplex complex_pos = {cos(freq), sin(freq)};
+
+    complex_input[i] = cuCmulf(complex_input[i], complex_pos);
+    input_ptr[real_part_index] = complex_input[i].x;
+    input_ptr[complex_part_index] = complex_input[i].y;
+  }
+}
+
+template <typename DT>
+__global__ void
+    apply_rotary_embedding_hf(DT *input_ptr,
+                              cuFloatComplex *complex_input,
+                              BatchConfig::PerTokenInfo const *tokenInfos,
+                              int qProjSize,
+                              int kProjSize,
+                              int num_q_heads,
+                              int num_tokens,
+                              int num_kv_heads,
+                              int q_block_size,
+                              int k_block_size,
+                              int q_array_size) {
+  CUDA_KERNEL_LOOP(
+      i,
+      num_tokens * (qProjSize * num_q_heads + kProjSize * num_kv_heads) / 2) {
+    // create complex number
+    bool q_tensor = i < (q_array_size / 2);
+    int proj_size = q_tensor ? qProjSize : kProjSize;
+    int real_i = q_tensor ? i : i - q_array_size / 2;
+
+    int head_idx = real_i / (num_tokens * proj_size / 2);
+    int idx = real_i % (num_tokens * proj_size / 2);
+    int token_idx =
+        (real_i - head_idx * (num_tokens * proj_size / 2)) / (proj_size / 2);
+
+    int real_part_index = idx + token_idx * (proj_size / 2) +
+                          head_idx * (q_tensor ? q_block_size : k_block_size) +
+                          (q_tensor ? 0 : q_array_size);
+    int complex_part_index = real_part_index + (proj_size / 2);
+
+    complex_input[i] = {input_ptr[real_part_index],
+                        input_ptr[complex_part_index]};
+
+    // get the freq_cis: shape 1 * (qProjSize/2) = 1 * 64
+    // apply a Cartesian coordinate transformation
+    // multiple with input & /copy back to q/k
+
+    // get position of token
+
+    // size_t pos = id_map[token_idx].token_position;
+    size_t pos = tokenInfos[token_idx].abs_depth_in_request;
+
+    // float before_real = complex_input[i].x, before_complex =
+    int pos_i = real_i % (proj_size / 2);
+    float freq = pos * (1.0 / pow(10000.0, (float)2 * pos_i / proj_size));
+    cuFloatComplex complex_pos = {cos(freq), sin(freq)};
+
+    complex_input[i] = cuCmulf(complex_input[i], complex_pos);
+    input_ptr[real_part_index] = complex_input[i].x;
+    input_ptr[complex_part_index] = complex_input[i].y;
+  }
+}
+
+template <typename DT>
+void compute_qkv_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                        BatchConfig const *bc,
+                        int shard_id,
+                        DT const *input_ptr,
+                        DT const *weight_ptr,
+                        DT *output_ptr,
+                        DT const *bias_ptr,
+                        cudaStream_t stream) {
+
+  checkCUDA(cublasSetStream(m->handle.blas, stream));
+  checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
+  DT alpha = 1.0f, beta = 0.0f;
+  assert(m->qSize == m->vSize && m->qSize == m->kSize);
+  cudaDataType_t cublas_data_type = ff_to_cuda_datatype(m->output_type[0]);
+#if CUDA_VERSION >= 11000
+  // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
+  cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
+#else
+  cudaDataType_t compute_type = cublas_data_type;
+#endif
+  // Compute (W^T)x matmul: einsum(ijkl,im->jmkl)
+  // Weights: qSize x qProjSize x 3 x num_q_heads
+  // Input: qSize x num_tokens
+  // Output >>> qProjSize x num_tokens x 3 x num_q_heads
+  int m_q = m->qProjSize;
+  int m_k = m->kProjSize;
+  int m_v = m->vProjSize;
+  assert(m_q == m_k && m_k == m_v); // keep things simple for now
+  int n = bc->num_active_tokens();
+  int k = m->qSize;
+  int m_ = m_q;
+  int lda = k, ldb = k, ldc = m_q;
+
+  size_t strideA = m_q * k; // query weight head size
+  size_t strideB = 0;       // input stays the same for all heads.
+  size_t strideC = m_q * n; // size of the output block for each head.
+
+  // compute QKV
+  checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                       CUBLAS_OP_T,
+                                       CUBLAS_OP_N,
+                                       m_,
+                                       n,
+                                       k,
+                                       &alpha,
+                                       weight_ptr,
+                                       cublas_data_type,
+                                       lda,
+                                       strideA,
+                                       input_ptr,
+                                       cublas_data_type,
+                                       ldb,
+                                       strideB,
+                                       &beta,
+                                       output_ptr,
+                                       cublas_data_type,
+                                       ldc,
+                                       strideC,
+                                       m->num_q_heads + m->num_kv_heads +
+                                           m->num_kv_heads,
+                                       compute_type,
+                                       CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+  // apply rotary emmmbedding for q and k
+  // step1 change the k, v to complex tensor
+  int num_tokens = bc->num_active_tokens();
+  int parallelism = m->kProjSize * num_tokens * m->num_q_heads;
+  int q_block_size = m->qProjSize * num_tokens;
+  int k_block_size = m->kProjSize * num_tokens;
+  int q_array_size = m->qProjSize * num_tokens * m->num_q_heads;
+  // apply bias for q, k, v
+  if (*m->bias) {
+    apply_proj_bias_qkv<<<GET_BLOCKS(parallelism),
+                          min(CUDA_NUM_THREADS, parallelism),
+                          0,
+                          stream>>>(output_ptr,
+                                    bias_ptr,
+                                    shard_id,
+                                    num_tokens,
+                                    m->qProjSize,
+                                    m->kProjSize,
+                                    m->vProjSize,
+                                    m->global_num_q_heads,
+                                    m->global_num_kv_heads,
+                                    m->num_q_heads,
+                                    m->num_kv_heads,
+                                    *m->scaling_query,
+                                    m->scaling_factor);
+  }
+  if (*m->apply_rotary_embedding) {
+    /*q&k*/
+    parallelism =
+        num_tokens *
+        (m->qProjSize * m->num_q_heads + m->kProjSize * m->num_kv_heads) / 2;
+    apply_rotary_embedding_hf<<<GET_BLOCKS(parallelism),
+                                min(CUDA_NUM_THREADS, parallelism),
+                                0,
+                                stream>>>(output_ptr,
+                                          m->complex_input,
+                                          m->token_infos,
+                                          m->qProjSize,
+                                          m->kProjSize,
+                                          m->num_q_heads,
+                                          num_tokens,
+                                          m->num_kv_heads,
+                                          q_block_size,
+                                          k_block_size,
+                                          q_array_size);
+  }
+}
+
+template <typename DT>
+void update_kv_cache_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                            BatchConfig const *bc,
+                            cudaStream_t stream) {
+  int num_tokens = bc->num_active_tokens();
+  if (num_tokens > 0) {
+    int parallelism =
+        (m->kProjSize + m->vProjSize) * num_tokens * m->num_kv_heads;
+    store_kv_cache<<<GET_BLOCKS(parallelism),
+                     min(CUDA_NUM_THREADS, parallelism),
+                     0,
+                     stream>>>(static_cast<DT *>(m->devQKVProjArray),
+                               static_cast<DT *>(m->keyCache),
+                               static_cast<DT *>(m->valueCache),
+                               m->token_infos,
+                               m->qProjSize,
+                               m->kProjSize,
+                               m->vProjSize,
+                               num_tokens,
+                               m->num_q_heads,
+                               m->num_kv_heads,
+                               BatchConfig::MAX_SEQ_LENGTH);
+  }
+}
+
+template <typename DT>
+void pre_build_weight_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                             GenericTensorAccessorR const weight,
+                             DataType data_type,
+                             cudaStream_t stream) {
+  // additional processing for weight uploading
+  // Note that we update weight_ptr and bias_ptr when uploading weight and
+  // bias
+  if (m->quantization_type != DT_NONE) {
+    // copy weight_ptr to quantized_weight_ptr, do compression and store in
+    // m->weight_ptr
+    cudaMemcpyAsync(m->quantized_weight_ptr,
+                    weight.get_byte_ptr(),
+                    m->quantized_weightSize,
+                    cudaMemcpyHostToDevice,
+                    stream);
+
+    if (m->quantization_type == DT_INT4) {
+      int parallelism = m->qProjSize * m->qSize * m->num_q_heads / 2;
+      decompress_int4_attention_weights<<<GET_BLOCKS(parallelism),
+                                          min(CUDA_NUM_THREADS, parallelism),
+                                          0,
+                                          stream>>>(
+          m->quantized_weight_ptr,
+          static_cast<DT *>(m->weight_ptr),
+          m->qProjSize,
+          m->qSize,
+          m->num_q_heads);
+    } else {
+      assert(m->quantization_type == DT_INT8);
+      int parallelism = m->qProjSize * m->qSize * m->num_q_heads;
+      decompress_int8_attention_weights<<<GET_BLOCKS(parallelism),
+                                          min(CUDA_NUM_THREADS, parallelism),
+                                          0,
+                                          stream>>>(
+          m->quantized_weight_ptr,
+          static_cast<DT *>(m->weight_ptr),
+          m->qProjSize,
+          m->qSize,
+          m->num_q_heads);
+    }
+  } else {
+    if (data_type == DT_FLOAT) {
+      cudaMemcpyAsync(m->weight_ptr,
+                      weight.get_float_ptr(),
+                      m->weightSize,
+                      cudaMemcpyHostToDevice,
+                      stream);
+    } else if (data_type == DT_HALF) {
+      cudaMemcpyAsync(m->weight_ptr,
+                      weight.get_half_ptr(),
+                      m->weightSize,
+                      cudaMemcpyHostToDevice,
+                      stream);
+    } else {
+      assert(false);
+    }
+  }
+}
+
+template <typename DT>
+void inference_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                      BatchConfig const *bc,
+                      int shard_id,
+                      DT const *input_ptr,
+                      DT const *weight_ptr,
+                      DT *output_ptr,
+                      DT const *bias_ptr,
+                      cudaStream_t stream) {
+  // here because we need postion info in infernece 1
+
+  if (m->offload && m->biasSize > 0) {
+    cudaMemcpyAsync(
+        m->bias_ptr, bias_ptr, m->biasSize, cudaMemcpyHostToDevice, stream);
+    bias_ptr = static_cast<DT *>(m->bias_ptr);
+  }
+  cudaMemcpyAsync(m->token_infos,
+                  &(bc->tokensInfo),
+                  bc->num_active_tokens() * sizeof(BatchConfig::PerTokenInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  // phase 1: Implement kernel to compute KQV for input tokens
+  compute_qkv_kernel(m,
+                     bc,
+                     shard_id,
+                     input_ptr,
+                     weight_ptr,
+                     static_cast<DT *>(m->devQKVProjArray),
+                     bias_ptr,
+                     stream);
+
+  // phase 2: Update key/val cache
+  update_kv_cache_kernel<DT>(m, bc, stream);
+
+  // phase 3: Compute attention score
+  // 3 kernels for pahse 3: matmul1 - softmax - matmal2
+  compute_attention_kernel(
+      m, bc, shard_id, output_ptr, bias_ptr, weight_ptr, stream);
+}
+
+} // namespace IncMultiHeadAttention
+} // namespace Kernels
+
+using namespace Kernels::IncMultiHeadAttention;
+
+template <typename DT>
+__global__ void store_kv_cache(DT const *devQKVProjArray,
+                               DT *kCache_ptr,
+                               DT *vCache_ptr,
+                               BatchConfig::PerTokenInfo const *tokenInfos,
+                               int qProjSize,
+                               int kProjSize,
+                               int vProjSize,
+                               int num_tokens,
+                               int num_q_heads,
+                               int num_kv_heads,
+                               int max_seq_len) {
+  CUDA_KERNEL_LOOP(i, num_tokens * (kProjSize + vProjSize) * num_kv_heads) {
+    int q_array_size = qProjSize * num_tokens * num_q_heads;
+    int k_array_size = kProjSize * num_tokens * num_kv_heads;
+
+    bool k_cache = i < k_array_size;
+    int real_i = k_cache ? i : i - k_array_size;
+
+    int proj_size = k_cache ? kProjSize : vProjSize;
+    int head_idx = real_i / (num_tokens * proj_size);
+    int token_idx = (real_i - head_idx * (num_tokens * proj_size)) / proj_size;
+    int data_idx = real_i % proj_size;
+
+    DT val = devQKVProjArray[q_array_size + (k_cache ? 0 : k_array_size) +
+                             head_idx * proj_size * num_tokens +
+                             token_idx * proj_size + data_idx];
+    int const req_id = tokenInfos[token_idx].request_index;
+    int const tok_id = tokenInfos[token_idx].abs_depth_in_request;
+
+    DT *cache_ptr = k_cache ? kCache_ptr : vCache_ptr;
+    cache_ptr[req_id * (num_kv_heads * max_seq_len * proj_size) +
+              head_idx * (max_seq_len * proj_size) + tok_id * proj_size +
+              data_idx] = val;
+  }
+}
+
+template <typename DT>
+__global__ void fill_entries_above_diagonal(DT *matrix,
+                                            size_t num_rows,
+                                            size_t num_cols,
+                                            size_t num_q_heads,
+                                            size_t entries_above_diagonal,
+                                            DT value) {
+  CUDA_KERNEL_LOOP(i, entries_above_diagonal * num_q_heads) {
+    size_t head_idx = i / entries_above_diagonal;
+    size_t entry_idx = i % entries_above_diagonal;
+    size_t y = (-1 + sqrt(8 * (float)entry_idx + 1)) / 2;
+    size_t x = entry_idx - y * (y + 1) / 2;
+    y += (num_cols - num_rows) + 1;
+    matrix[head_idx * num_rows * num_cols + num_cols * y + x] = value;
+  }
+}
+
+template <typename DT>
+void compute_attention_kernel(IncMultiHeadSelfAttentionMeta const *m,
+                              BatchConfig const *bc,
+                              int shard_id,
+                              DT *output_ptr,
+                              DT const *bias_ptr,
+                              DT const *weight_ptr,
+                              cudaStream_t stream) {
+  checkCUDA(cublasSetStream(m->handle.blas, stream));
+  checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
+  cudaDataType_t cublas_data_type = ff_to_cuda_datatype(m->output_type[0]);
+  cudnnDataType_t cudnn_data_type = ff_to_cudnn_datatype(m->output_type[0]);
+  assert(data_type_size(m->output_type[0]) == sizeof(DT));
+#if CUDA_VERSION >= 11000
+  // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
+  cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
+#else
+  cudaDataType_t compute_type = cublas_data_type;
+#endif
+  // int num_requests = bc->num_active_requests();
+  int num_tokens = bc->num_active_tokens();
+  int tokens_previous_requests = 0;
+  int q_block_size = m->qProjSize * num_tokens;
+  int kt_block_size = m->kProjSize * BatchConfig::MAX_SEQ_LENGTH;
+  int kt_req_block_size = kt_block_size * m->num_kv_heads;
+  int vt_block_size = m->vProjSize * BatchConfig::MAX_SEQ_LENGTH;
+  int vt_req_block_size = vt_block_size * m->num_kv_heads;
+  assert(m->qProjSize == m->kProjSize);
+
+  for (int i = 0; i < bc->MAX_NUM_REQUESTS; i++) {
+    if (bc->request_completed[i]) {
+      continue;
+    }
+    int num_new_tokens = bc->requestsInfo[i].num_tokens_in_batch;
+    int total_tokens = bc->requestsInfo[i].token_start_offset +
+                       bc->requestsInfo[i].num_tokens_in_batch;
+    // bc->token_last_available_idx[i] + 1;
+    // Compute (QK^T/sqrt(d_k))
+    // a flag of using this scaling alpha
+    int m_ = num_new_tokens;
+    int n = total_tokens;
+    int k = m->qProjSize;
+    int lda = k, ldb = k, ldc = m_;
+    int strideA = q_block_size;
+    int strideB = kt_block_size;
+    int strideC = num_new_tokens * total_tokens;
+    DT alpha = 1.0f, beta = 0.0f;
+    if (*m->qk_prod_scaling) {
+      alpha = static_cast<DT>(1.0f / sqrt(m->kProjSize));
+    }
+    // To get A, skip over Q entries from previous requests (same head)
+    DT const *A = static_cast<DT *>(m->devQKVProjArray) +
+                  tokens_previous_requests * m->qProjSize;
+    // To get B, skip over K entries from previous requests (all heads +
+    // padding)
+    DT const *B = static_cast<DT *>(m->keyCache) + i * kt_req_block_size;
+    // To get C, skip over QK^T products from previous requests
+    DT *C = static_cast<DT *>(m->qk_prods);
+    if (m->num_kv_heads == m->num_q_heads) {
+      checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                           CUBLAS_OP_T,
+                                           CUBLAS_OP_N,
+                                           m_,
+                                           n,
+                                           k,
+                                           &alpha,
+                                           A,
+                                           cublas_data_type,
+                                           lda,
+                                           strideA,
+                                           B,
+                                           cublas_data_type,
+                                           ldb,
+                                           strideB,
+                                           &beta,
+                                           C,
+                                           cublas_data_type,
+                                           ldc,
+                                           strideC,
+                                           m->num_q_heads,
+                                           compute_type,
+                                           CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+    } else {
+      strideB = 0;
+      // use cublasGemmStridedBatchedEx
+      int one_step_heads = m->num_q_heads / m->num_kv_heads;
+      m_ = num_new_tokens;
+      n = total_tokens;
+      k = m->qProjSize;
+      lda = k, ldb = k, ldc = m_;
+      for (int step = 0; step < m->num_kv_heads; step++) {
+        checkCUDA(
+            cublasGemmStridedBatchedEx(m->handle.blas,
+                                       CUBLAS_OP_T,
+                                       CUBLAS_OP_N,
+                                       m_,
+                                       n,
+                                       k,
+                                       &alpha,
+                                       A + step * strideA * one_step_heads,
+                                       cublas_data_type,
+                                       lda,
+                                       strideA,
+                                       B + step * kt_block_size,
+                                       cublas_data_type,
+                                       ldb,
+                                       strideB,
+                                       &beta,
+                                       C + step * strideC * one_step_heads,
+                                       cublas_data_type,
+                                       ldc,
+                                       strideC,
+                                       one_step_heads,
+                                       compute_type,
+                                       CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      }
+    }
+    // Fill all elements above diagonal in qk prods with -inf to force
+    // causal attention.
+    assert(num_new_tokens <= total_tokens);
+    size_t entries_above_diagonal = num_new_tokens * (num_new_tokens - 1) / 2;
+    if (entries_above_diagonal > 0) {
+      size_t parallelism = m->num_q_heads * entries_above_diagonal;
+      fill_entries_above_diagonal<<<GET_BLOCKS(parallelism),
+                                    min((size_t)CUDA_NUM_THREADS, parallelism),
+                                    0,
+                                    stream>>>(C,
+                                              num_new_tokens,
+                                              total_tokens,
+                                              m->num_q_heads,
+                                              entries_above_diagonal,
+                                              static_cast<DT>(-INFINITY));
+    }
+    // Compute Softmax(QK^T/sqrt(d_k))
+    // Before modifying the parameters below, make sure to read the following
+    // description of the CUDNN_TENSOR_NCHW tensor layout, from
+    // https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorFormat_t:
+    // This tensor format specifies that the data is laid out in the following
+    // order: batch size, feature maps, rows, columns. The strides are
+    // implicitly defined in such a way that the data are contiguous in memory
+    // with no padding between images, feature maps, rows, and columns; the
+    // columns are the inner dimension and the images are the outermost
+    // dimension.
+    int n_param = m->num_q_heads;
+    int c_param = total_tokens;
+    int h_param = 1;
+    int w_param = num_new_tokens;
+    checkCUDNN(cudnnSetTensor4dDescriptor(m->qk_tensor,
+                                          CUDNN_TENSOR_NCHW,
+                                          cudnn_data_type,
+                                          n_param,
+                                          c_param,
+                                          h_param,
+                                          w_param));
+    float softmax_alpha = 1.0f, softmax_beta = 0.0f;
+    DT *C_softmax = static_cast<DT *>(m->qk_prods_softmax);
+    // The softmax operation below is executed according to the
+    // CUDNN_SOFTMAX_MODE_CHANNEL, which is also described in the docs: The
+    // softmax operation is computed per spatial location (H,W) per image (N)
+    // across dimension C.
+    checkCUDNN(cudnnSoftmaxForward(m->handle.dnn,
+                                   CUDNN_SOFTMAX_ACCURATE,
+                                   CUDNN_SOFTMAX_MODE_CHANNEL,
+                                   &softmax_alpha,
+                                   m->qk_tensor,
+                                   C,
+                                   &softmax_beta,
+                                   m->qk_tensor,
+                                   C_softmax));
+    // Matmul softmax(QK^T/sqrt(d_k)) by V
+    alpha = 1.0f, beta = 0.0f;
+    m_ = num_new_tokens;
+    n = m->vProjSize;
+    k = total_tokens;
+    lda = m_, ldb = n, ldc = m_;
+    strideA = num_new_tokens * total_tokens;
+    strideB = vt_block_size;
+    strideC = num_new_tokens * m->vProjSize;
+    // To get A, skip over softmax(QK^T/sqrt(d_k)) entries from previous
+    // requests (all heads)
+    A = C_softmax;
+    // To get B, skip over V^T entries from previous requests (all heads +
+    // padding)
+    B = static_cast<DT *>(m->valueCache) + i * vt_req_block_size;
+    // To get C, skip over softmax(QK^T/sqrt(d_k))V products from previous
+    // requests
+    C = static_cast<DT *>(m->attn_heads) +
+        tokens_previous_requests * m->num_q_heads * m->vProjSize;
+
+    if (m->num_q_heads == m->num_kv_heads) {
+      checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                           CUBLAS_OP_N,
+                                           CUBLAS_OP_T,
+                                           m_,
+                                           n,
+                                           k,
+                                           &alpha,
+                                           A,
+                                           cublas_data_type,
+                                           lda,
+                                           strideA,
+                                           B,
+                                           cublas_data_type,
+                                           ldb,
+                                           strideB,
+                                           &beta,
+                                           C,
+                                           cublas_data_type,
+                                           ldc,
+                                           strideC,
+                                           m->num_q_heads,
+                                           compute_type,
+                                           CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+    } else {
+      int one_step_heads = m->num_q_heads / m->num_kv_heads;
+      n = m->vProjSize;
+      lda = m_, ldb = n, ldc = m_;
+      strideA = num_new_tokens * total_tokens;
+      strideB = 0;
+      strideC = num_new_tokens * m->vProjSize;
+      for (int step = 0; step < m->num_kv_heads; step++) {
+        checkCUDA(
+            cublasGemmStridedBatchedEx(m->handle.blas,
+                                       CUBLAS_OP_N,
+                                       CUBLAS_OP_T,
+                                       m_,
+                                       n,
+                                       k,
+                                       &alpha,
+                                       A + step * one_step_heads * strideA,
+                                       cublas_data_type,
+                                       lda,
+                                       strideA,
+                                       B + step * vt_block_size,
+                                       cublas_data_type,
+                                       ldb,
+                                       strideB,
+                                       &beta,
+                                       C + step * one_step_heads * strideC,
+                                       cublas_data_type,
+                                       ldc,
+                                       strideC,
+                                       one_step_heads,
+                                       compute_type,
+                                       CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      }
+    }
+    // Project to output, save result directly on output tensor
+    alpha = 1.0f, beta = 0.0f;
+    m_ = m->oProjSize;
+    k = m->vProjSize * m->num_q_heads;
+    n = num_new_tokens;
+    lda = k, ldb = n, ldc = m_;
+    A = weight_ptr + m->qSize * (m->qProjSize * m->num_q_heads +
+                                 m->kProjSize * m->num_kv_heads +
+                                 m->vProjSize * m->num_kv_heads);
+    B = C;
+    C = static_cast<DT *>(output_ptr) + tokens_previous_requests * m->oProjSize;
+
+    checkCUDA(cublasGemmEx(m->handle.blas,
+                           CUBLAS_OP_T,
+                           CUBLAS_OP_T,
+                           m_,
+                           n,
+                           k,
+                           &alpha,
+                           A,
+                           cublas_data_type,
+                           lda,
+                           B,
+                           cublas_data_type,
+                           ldb,
+                           &beta,
+                           C,
+                           cublas_data_type,
+                           ldc,
+                           compute_type,
+                           CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+    tokens_previous_requests += num_new_tokens;
+  }
+
+  if (*m->bias && shard_id == 0) {
+    int parallelism = m->oProjSize * num_tokens;
+    int qkv_weight_size = m->qProjSize * m->global_num_q_heads +
+                          m->kProjSize * m->global_num_kv_heads +
+                          m->vProjSize * m->global_num_kv_heads;
+
+    apply_proj_bias_w<<<GET_BLOCKS(parallelism),
+                        min(CUDA_NUM_THREADS, parallelism),
+                        0,
+                        stream>>>(
+        output_ptr, bias_ptr, num_tokens, qkv_weight_size, m->oProjSize);
+  }
+
+  assert(tokens_previous_requests == num_tokens);
+}
+
+/*static*/
+void IncMultiHeadSelfAttention::inference_kernel_wrapper(
+    IncMultiHeadSelfAttentionMeta const *m,
+    BatchConfig const *bc,
+    int shard_id,
+    GenericTensorAccessorR const &input,
+    GenericTensorAccessorR const &weight,
+    GenericTensorAccessorW const &output,
+    GenericTensorAccessorR const &bias) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  bool use_bias = *m->bias;
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  // assert(input.data_type == weight.data_type);
+  assert(input.data_type == output.data_type);
+  if (use_bias) {
+    assert(input.data_type == bias.data_type);
+  }
+
+  if (input.data_type == DT_HALF) {
+    if (m->offload) {
+      pre_build_weight_kernel<half>(m, weight, input.data_type, stream);
+    }
+    half const *bias_ptr =
+        use_bias ? bias.get_half_ptr() : static_cast<half const *>(nullptr);
+    Kernels::IncMultiHeadAttention::inference_kernel(
+        m,
+        bc,
+        shard_id,
+        input.get_half_ptr(),
+        m->offload ? static_cast<half *>(m->weight_ptr) : weight.get_half_ptr(),
+        output.get_half_ptr(),
+        bias_ptr,
+        stream);
+  } else if (input.data_type == DT_FLOAT) {
+    if (m->offload) {
+      pre_build_weight_kernel<float>(m, weight, input.data_type, stream);
+    }
+    float const *bias_ptr =
+        use_bias ? bias.get_float_ptr() : static_cast<float const *>(nullptr);
+    Kernels::IncMultiHeadAttention::inference_kernel(
+        m,
+        bc,
+        shard_id,
+        input.get_float_ptr(),
+        m->offload ? static_cast<float *>(m->weight_ptr)
+                   : weight.get_float_ptr(),
+        output.get_float_ptr(),
+        bias_ptr,
+        stream);
+  } else {
+    assert(false && "Unspported data type");
+  }
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("IncMultiHeadSelfAttention forward time = %.2fms\n", elapsed);
+    // print_tensor<3, float>(acc_query.ptr, acc_query.rect,
+    // "[Attention:forward:query]"); print_tensor<3, float>(acc_output.ptr,
+    // acc_output.rect, "[Attention:forward:output]");
+  }
+}
+
+IncMultiHeadSelfAttentionMeta::IncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    IncMultiHeadSelfAttention const *attn,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _num_q_heads,
+    int _num_kv_heads)
+    : IncMultiHeadSelfAttentionMeta(handler,
+                                    INC_DECODING_MODE,
+                                    attn,
+                                    attn->qSize,
+                                    attn->kSize,
+                                    attn->vSize,
+                                    attn->qProjSize,
+                                    attn->kProjSize,
+                                    attn->vProjSize,
+                                    attn->oProjSize,
+                                    attn->apply_rotary_embedding,
+                                    attn->bias,
+                                    attn->scaling_query,
+                                    attn->qk_prod_scaling,
+                                    attn->add_bias_kv,
+                                    attn->scaling_factor,
+                                    weight,
+                                    gpu_mem_allocator,
+                                    num_samples,
+                                    attn->num_q_heads,
+                                    attn->num_kv_heads,
+                                    _num_q_heads,
+                                    _num_kv_heads,
+                                    attn->quantization_type,
+                                    attn->offload) {}
+
+IncMultiHeadSelfAttentionMeta::IncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    InferenceMode infer_mode,
+    Op const *attn,
+    int _qSize,
+    int _kSize,
+    int _vSize,
+    int _qProjSize,
+    int _kProjSize,
+    int _vProjSize,
+    int _oProjSize,
+    bool _apply_rotary_embedding,
+    bool _bias,
+    bool _scaling_query,
+    bool _qk_prod_scaling,
+    bool _add_bias_kv,
+    float _scaling_factor,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _global_num_q_heads,
+    int _global_num_kv_heads,
+    int _num_q_heads,
+    int _num_kv_heads,
+    DataType _quantization_type,
+    bool _offload)
+    : OpMeta(handler, attn), weight_ptr(nullptr), bias_ptr(nullptr) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(cudnnSetStream(handler.dnn, stream));
+  checkCUDNN(cudnnCreateTensorDescriptor(&qk_tensor));
+  qSize = _qSize;
+  kSize = _kSize;
+  vSize = _vSize;
+  // assume dimensions match for now
+  assert(qSize == kSize);
+  assert(kSize == vSize);
+  qProjSize = _qProjSize;
+  kProjSize = _kProjSize;
+  assert(qProjSize == kProjSize); // required for attention QK^T matmul
+  vProjSize = _vProjSize;
+  oProjSize = _oProjSize;
+  size_t size_of_dt = data_type_size(attn->data_type);
+  quantization_type = _quantization_type;
+  offload = _offload;
+
+  global_num_q_heads = _global_num_q_heads;
+  global_num_kv_heads = _global_num_kv_heads;
+  num_q_heads = _num_q_heads;
+  num_kv_heads = _num_kv_heads;
+
+  weightSize =
+      ((qSize * qProjSize + oProjSize * (vProjSize > 0 ? vProjSize : vSize)) *
+           num_q_heads +
+       (kSize * kProjSize + vSize * vProjSize) * num_kv_heads) *
+      size_of_dt;
+  if (quantization_type != DT_NONE) {
+    quantized_weightSize = get_quantization_to_byte_size(
+        attn->data_type, quantization_type, weightSize);
+  }
+  biasSize = _bias ? oProjSize * size_of_dt * 4 : 0;
+  // has_load_weights = (bool *)calloc(1, sizeof(bool));
+  //*has_load_weights = false;
+  apply_rotary_embedding = (bool *)calloc(1, sizeof(bool));
+  *apply_rotary_embedding = _apply_rotary_embedding;
+  bias = (bool *)calloc(1, sizeof(bool));
+  *bias = _bias;
+  scaling_query = (bool *)calloc(1, sizeof(bool));
+  *scaling_query = _scaling_query;
+  scaling_factor = _scaling_factor;
+  qk_prod_scaling = (bool *)calloc(1, sizeof(bool));
+  *qk_prod_scaling = _qk_prod_scaling;
+  // Currently do not support adding bias to key/value projection
+  assert(!_add_bias_kv);
+
+  // allocate weight and bias in the reserve space for cpu offloading
+  if (offload) {
+    weight_ptr = gpu_mem_allocator.allocate_reserved_untyped(weightSize);
+    bias_ptr = gpu_mem_allocator.allocate_reserved_untyped(biasSize);
+  }
+
+#ifdef INFERENCE_TESTS
+  kcache = (float *)calloc(kProjSize * BatchConfig::MAX_SEQ_LENGTH *
+                               num_q_heads * BatchConfig::MAX_NUM_REQUESTS,
+                           sizeof(float));
+  vcache = (float *)calloc(vProjSize * BatchConfig::MAX_SEQ_LENGTH *
+                               num_q_heads * BatchConfig::MAX_NUM_REQUESTS,
+                           sizeof(float));
+#endif
+
+  // allocate memory for the seqArray and reserve space
+  {
+    // size_t qkv_proj_dim = qProjSize + kProjSize + vProjSize;
+    // size_t qkv_max_proj_size =
+    //     BatchConfig::MAX_NUM_TOKENS * qkv_proj_dim * num_q_heads;
+
+    size_t qkv_max_proj_size =
+        BatchConfig::MAX_NUM_TOKENS *
+        (qProjSize * num_q_heads + kProjSize * num_kv_heads +
+         vProjSize * num_kv_heads);
+    // std::cout << "num_kv_heads: " << BatchConfig::MAX_NUM_TOKENS << ", "
+    //           << qProjSize << ", " << kProjSize << ", " << vProjSize << ", "
+    //           << num_q_heads << ", " << num_kv_heads << ", " <<
+    //           qkv_max_proj_size
+    //           << std::endl;
+    // assert(false);
+    size_t key_cache_size = 0, value_cache_size = 0;
+    switch (infer_mode) {
+      case INC_DECODING_MODE:
+      case TREE_VERIFY_MODE: {
+        key_cache_size = num_kv_heads * kProjSize *
+                         BatchConfig::MAX_NUM_REQUESTS *
+                         BatchConfig::MAX_SEQ_LENGTH;
+        value_cache_size = num_kv_heads * vProjSize *
+                           BatchConfig::MAX_NUM_REQUESTS *
+                           BatchConfig::MAX_SEQ_LENGTH;
+        break;
+      }
+      case BEAM_SEARCH_MODE: {
+        key_cache_size =
+            num_kv_heads * kProjSize * BeamSearchBatchConfig::MAX_NUM_REQUESTS *
+            BatchConfig::MAX_SEQ_LENGTH * BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+        value_cache_size =
+            num_kv_heads * vProjSize * BeamSearchBatchConfig::MAX_NUM_REQUESTS *
+            BatchConfig::MAX_SEQ_LENGTH * BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+        break;
+      }
+      default:
+        assert(false && "Unkown inference mode");
+    }
+    size_t tokeninfo_size = BatchConfig::MAX_NUM_TOKENS;
+    size_t qk_prod_size =
+        BatchConfig::MAX_NUM_TOKENS * BatchConfig::MAX_SEQ_LENGTH * num_q_heads;
+    size_t attn_heads_size =
+        BatchConfig::MAX_NUM_TOKENS * num_q_heads * vProjSize;
+    size_t W_out_block_size = oProjSize * (vProjSize > 0 ? vProjSize : vSize);
+    size_t W_out_contiguous_size = W_out_block_size * num_q_heads;
+    size_t complex_size =
+        (BatchConfig::MAX_NUM_TOKENS *
+         (qProjSize * num_q_heads + kProjSize * num_kv_heads)) /
+        2;
+    size_t totalSize =
+        (qkv_max_proj_size + key_cache_size + value_cache_size +
+         2 * qk_prod_size + attn_heads_size + W_out_contiguous_size) *
+            size_of_dt +
+        tokeninfo_size * sizeof(BatchConfig::PerTokenInfo) +
+        complex_size * sizeof(cuFloatComplex); // more components will
+                                               // be added here later
+    if (offload) {
+      // assert that we have enough reserved work space left
+      size_t totalSharedSize =
+          infer_mode == TREE_VERIFY_MODE
+              ? totalSize -
+                    (key_cache_size + value_cache_size + qkv_max_proj_size) *
+                        size_of_dt
+              : totalSize - (key_cache_size + value_cache_size) * size_of_dt;
+
+      size_t instance_size =
+          size_of_dt *
+          (infer_mode == TREE_VERIFY_MODE
+               ? key_cache_size + value_cache_size + qkv_max_proj_size
+               : key_cache_size + value_cache_size);
+
+      if (quantization_type != DT_NONE) {
+        totalSharedSize += quantized_weightSize;
+      }
+      assert(gpu_mem_allocator.reserved_total_size -
+                 gpu_mem_allocator.reserved_allocated_size >=
+             totalSharedSize);
+      gpu_mem_allocator.create_legion_instance(reserveInst, instance_size);
+    } else {
+      gpu_mem_allocator.create_legion_instance(reserveInst, totalSize);
+    }
+
+    // in tree_verify, enable devQKVProjArray;
+    if (!offload || infer_mode == TREE_VERIFY_MODE) {
+      devQKVProjArray = gpu_mem_allocator.allocate_instance_untyped(
+          qkv_max_proj_size * size_of_dt);
+    } else {
+      devQKVProjArray = gpu_mem_allocator.allocate_reserved_untyped(
+          qkv_max_proj_size * size_of_dt);
+      // offset += qkv_max_proj_size * size_of_dt;
+    }
+
+    // use key value cache in all mode.
+    keyCache = gpu_mem_allocator.allocate_instance_untyped(key_cache_size *
+                                                           size_of_dt);
+    valueCache = gpu_mem_allocator.allocate_instance_untyped(value_cache_size *
+                                                             size_of_dt);
+
+    if (offload) {
+      token_infos =
+          gpu_mem_allocator.allocate_reserved<BatchConfig::PerTokenInfo>(
+              tokeninfo_size);
+      // offset += sizeof(BatchConfig::PerTokenInfo) * tokeninfo_size;
+      qk_prods = gpu_mem_allocator.allocate_reserved_untyped(qk_prod_size *
+                                                             size_of_dt);
+      // offset += qk_prod_size * size_of_dt;
+      qk_prods_softmax = gpu_mem_allocator.allocate_reserved_untyped(
+          qk_prod_size * size_of_dt);
+      // offset += qk_prod_size * size_of_dt;
+      attn_heads = gpu_mem_allocator.allocate_reserved_untyped(attn_heads_size *
+                                                               size_of_dt);
+      // offset += attn_heads_size * size_of_dt;
+      W_out_contiguous = gpu_mem_allocator.allocate_reserved_untyped(
+          W_out_contiguous_size * size_of_dt);
+      // offset += W_out_contiguous_size * size_of_dt;
+      complex_input =
+          gpu_mem_allocator.allocate_reserved<cuFloatComplex>(complex_size);
+      // offset += complex_size * sizeof(cuFloatComplex);
+    } else {
+      token_infos =
+          gpu_mem_allocator.allocate_instance<BatchConfig::PerTokenInfo>(
+              tokeninfo_size);
+      qk_prods = gpu_mem_allocator.allocate_instance_untyped(qk_prod_size *
+                                                             size_of_dt);
+      qk_prods_softmax = gpu_mem_allocator.allocate_instance_untyped(
+          qk_prod_size * size_of_dt);
+      attn_heads = gpu_mem_allocator.allocate_instance_untyped(attn_heads_size *
+                                                               size_of_dt);
+      W_out_contiguous = gpu_mem_allocator.allocate_instance_untyped(
+          W_out_contiguous_size * size_of_dt);
+      complex_input =
+          gpu_mem_allocator.allocate_instance<cuFloatComplex>(complex_size);
+    }
+
+    // allocate more size for quantization data
+    if (quantization_type != DT_NONE) {
+      assert(offload);
+      quantized_weight_ptr =
+          gpu_mem_allocator.allocate_reserved<char>(quantized_weightSize);
+    }
+    if (!offload) {
+      assert(gpu_mem_allocator.reserved_total_size ==
+             gpu_mem_allocator.reserved_allocated_size);
+    }
+  }
+  cudaStreamSynchronize(stream);
+}
+
+IncMultiHeadSelfAttentionMeta::~IncMultiHeadSelfAttentionMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
+#ifdef INFERENCE_TESTS
+  free(kcache);
+  free(vcache);
+#endif
+}
+
+template void Kernels::IncMultiHeadAttention::pre_build_weight_kernel<float>(
+    IncMultiHeadSelfAttentionMeta const *m,
+    GenericTensorAccessorR const weight,
+    DataType data_type,
+    cudaStream_t stream);
+
+template void Kernels::IncMultiHeadAttention::pre_build_weight_kernel<half>(
+    IncMultiHeadSelfAttentionMeta const *m,
+    GenericTensorAccessorR const weight,
+    DataType data_type,
+    cudaStream_t stream);
+
+}; // namespace FlexFlow
diff --git a/src/ops/kernels/decompress_kernels.cu b/src/ops/kernels/decompress_kernels.cu
new file mode 100644
index 0000000000..2e02ce1eec
--- /dev/null
+++ b/src/ops/kernels/decompress_kernels.cu
@@ -0,0 +1,261 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/kernels/decompress_kernels.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+
+namespace Kernels {
+
+template <typename DT>
+__global__ void decompress_int4_general_weights(char const *input_weight_ptr,
+                                                DT *weight_ptr,
+                                                int in_dim,
+                                                int valueSize) {
+  // eg. in dim = 3072, out dim = 768
+  CUDA_KERNEL_LOOP(i, valueSize / 2) {
+    size_t real_idx_first = i * 2;
+    size_t real_idx_second = i * 2 + 1;
+    size_t group_idx =
+        (real_idx_first / (in_dim * INT4_NUM_OF_ELEMENTS_PER_GROUP)) * in_dim +
+        real_idx_first % in_dim;
+    size_t idx = i;
+    size_t offset_idx = (valueSize / 2) + group_idx * sizeof(DT);
+    size_t scale_idx = offset_idx + sizeof(DT) * (valueSize / 32);
+
+    weight_ptr[real_idx_first] =
+        static_cast<DT>((input_weight_ptr[idx] >> 4) & 0xF) /
+            (*(DT *)(input_weight_ptr + scale_idx)) +
+        (*(DT *)(input_weight_ptr + offset_idx));
+    weight_ptr[real_idx_second] =
+        static_cast<DT>(input_weight_ptr[idx] & 0xF) /
+            (*(DT *)(input_weight_ptr + scale_idx + sizeof(DT))) +
+        (*(DT *)(input_weight_ptr + offset_idx + sizeof(DT)));
+  }
+}
+
+template <typename DT>
+__global__ void decompress_int8_general_weights(char const *input_weight_ptr,
+                                                DT *weight_ptr,
+                                                int in_dim,
+                                                int valueSize) {
+  CUDA_KERNEL_LOOP(i, valueSize) {
+    size_t idx = i;
+    size_t group_idx =
+        (idx / (in_dim * INT4_NUM_OF_ELEMENTS_PER_GROUP)) * in_dim +
+        idx % in_dim;
+    size_t offset_idx = valueSize + group_idx * sizeof(DT);
+    size_t scale_idx = offset_idx + sizeof(DT) * (valueSize / 32);
+    weight_ptr[idx] = static_cast<DT>(input_weight_ptr[idx] & 0xFF) /
+                          (*(DT *)(input_weight_ptr + scale_idx)) +
+                      (*(DT *)(input_weight_ptr + offset_idx));
+  }
+}
+
+template <typename DT>
+__global__ void decompress_int4_attention_weights(char *input_weight_ptr,
+                                                  DT *weight_ptr,
+                                                  int qProjSize,
+                                                  int qSize,
+                                                  int num_heads) {
+  // TODO this is because in top level function we assume q,k,v in same size
+  CUDA_KERNEL_LOOP(i, qProjSize * num_heads * qSize / 2) {
+    int q_block_size = (qProjSize * qSize) / 2;
+    int real_q_block_size = q_block_size * 2;
+    size_t qkvo_block_size = q_block_size * 4;
+    size_t real_qkvo_block_size = qkvo_block_size * 2;
+
+    int group_idx = (i * 2 / (INT4_NUM_OF_ELEMENTS_PER_GROUP * qSize)) * qSize +
+                    (i * 2) % qSize;
+    // i * 2 / (INT4_NUM_OF_ELEMENTS_PER_GROUP);
+    int head_idx = i / q_block_size;
+    int data_idx = i % q_block_size;
+
+    size_t idx_q = head_idx * qkvo_block_size + data_idx;
+    size_t idx_k = idx_q + q_block_size;
+    size_t idx_v = idx_k + q_block_size;
+    size_t idx_o = idx_v + q_block_size;
+
+    size_t real_idx_q_first = head_idx * real_qkvo_block_size + data_idx * 2;
+    size_t real_idx_q_second = real_idx_q_first + 1;
+    size_t real_idx_k_first =
+        head_idx * real_qkvo_block_size + real_q_block_size + data_idx * 2;
+    size_t real_idx_k_second = real_idx_k_first + 1;
+    size_t real_idx_v_first =
+        head_idx * real_qkvo_block_size + real_q_block_size * 2 + data_idx * 2;
+    size_t real_idx_v_second = real_idx_v_first + 1;
+    size_t real_idx_o_first =
+        head_idx * real_qkvo_block_size + real_q_block_size * 3 + data_idx * 2;
+    size_t real_idx_o_second = real_idx_o_first + 1;
+
+    size_t meta_offset = num_heads * qkvo_block_size;
+    size_t one_meta_size = sizeof(DT) * (qProjSize * num_heads * qSize / 32);
+    size_t q_offset_idx = meta_offset + group_idx * sizeof(DT);
+    size_t q_scaling_idx = q_offset_idx + one_meta_size;
+
+    size_t k_offset_idx = q_scaling_idx + one_meta_size;
+    size_t k_scaling_idx = k_offset_idx + one_meta_size;
+
+    size_t v_offset_idx = k_scaling_idx + one_meta_size;
+    size_t v_scaling_idx = v_offset_idx + one_meta_size;
+
+    size_t o_offset_idx = v_scaling_idx + one_meta_size;
+    size_t o_scaling_idx = o_offset_idx + one_meta_size;
+
+    weight_ptr[real_idx_q_first] =
+        static_cast<DT>((input_weight_ptr[idx_q] >> 4) & 0xF) /
+            (*(DT *)(input_weight_ptr + q_scaling_idx)) +
+        (*(DT *)(input_weight_ptr + q_offset_idx));
+    weight_ptr[real_idx_q_second] =
+        static_cast<DT>((input_weight_ptr[idx_q] & 0xF)) /
+            (*(DT *)(input_weight_ptr + q_scaling_idx + sizeof(DT))) +
+        (*(DT *)(input_weight_ptr + q_offset_idx + sizeof(DT)));
+    weight_ptr[real_idx_k_first] =
+        static_cast<DT>((input_weight_ptr[idx_k] >> 4) & 0xF) /
+            (*(DT *)(input_weight_ptr + k_scaling_idx)) +
+        (*(DT *)(input_weight_ptr + k_offset_idx));
+    weight_ptr[real_idx_k_second] =
+        static_cast<DT>((input_weight_ptr[idx_k] & 0xF)) /
+            (*(DT *)(input_weight_ptr + k_scaling_idx + sizeof(DT))) +
+        (*(DT *)(input_weight_ptr + k_offset_idx + sizeof(DT)));
+    weight_ptr[real_idx_v_first] =
+        static_cast<DT>((input_weight_ptr[idx_v] >> 4) & 0xF) /
+            (*(DT *)(input_weight_ptr + v_scaling_idx)) +
+        (*(DT *)(input_weight_ptr + v_offset_idx));
+    weight_ptr[real_idx_v_second] =
+        static_cast<DT>((input_weight_ptr[idx_v] & 0xF)) /
+            (*(DT *)(input_weight_ptr + v_scaling_idx + sizeof(DT))) +
+        (*(DT *)(input_weight_ptr + v_offset_idx + sizeof(DT)));
+    weight_ptr[real_idx_o_first] =
+        static_cast<DT>((input_weight_ptr[idx_o] >> 4) & 0xF) /
+            (*(DT *)(input_weight_ptr + o_scaling_idx)) +
+        (*(DT *)(input_weight_ptr + o_offset_idx));
+    weight_ptr[real_idx_o_second] =
+        static_cast<DT>((input_weight_ptr[idx_o] & 0xF)) /
+            (*(DT *)(input_weight_ptr + o_scaling_idx + sizeof(DT))) +
+        (*(DT *)(input_weight_ptr + o_offset_idx + sizeof(DT)));
+  }
+}
+
+template <typename DT>
+__global__ void decompress_int8_attention_weights(char *input_weight_ptr,
+                                                  DT *weight_ptr,
+                                                  int qProjSize,
+                                                  int qSize,
+                                                  int num_heads) {
+  // TODO this is because in top level function we assume q,k,v in same size
+  CUDA_KERNEL_LOOP(i, qProjSize * num_heads * qSize) {
+    int q_block_size = qProjSize * qSize;
+    size_t qkvo_block_size = q_block_size * 4;
+
+    int group_idx =
+        (i / (INT4_NUM_OF_ELEMENTS_PER_GROUP * qSize)) * qSize + i % qSize;
+    // i * 2 / (INT4_NUM_OF_ELEMENTS_PER_GROUP);
+    int head_idx = i / q_block_size;
+    int data_idx = i % q_block_size;
+
+    size_t idx_q = head_idx * qkvo_block_size + data_idx;
+    size_t idx_k = idx_q + q_block_size;
+    size_t idx_v = idx_k + q_block_size;
+    size_t idx_o = idx_v + q_block_size;
+
+    size_t meta_offset = num_heads * qkvo_block_size;
+    size_t one_meta_size = sizeof(DT) * (qProjSize * num_heads * qSize / 32);
+    size_t q_offset_idx = meta_offset + group_idx * sizeof(DT);
+    size_t q_scaling_idx = q_offset_idx + one_meta_size;
+
+    size_t k_offset_idx = q_scaling_idx + one_meta_size;
+    size_t k_scaling_idx = k_offset_idx + one_meta_size;
+
+    size_t v_offset_idx = k_scaling_idx + one_meta_size;
+    size_t v_scaling_idx = v_offset_idx + one_meta_size;
+
+    size_t o_offset_idx = v_scaling_idx + one_meta_size;
+    size_t o_scaling_idx = o_offset_idx + one_meta_size;
+
+    weight_ptr[idx_q] = static_cast<DT>(input_weight_ptr[idx_q] & 0xFF) /
+                            (*(DT *)(input_weight_ptr + q_scaling_idx)) +
+                        (*(DT *)(input_weight_ptr + q_offset_idx));
+    weight_ptr[idx_k] = static_cast<DT>(input_weight_ptr[idx_k] & 0xFF) /
+                            (*(DT *)(input_weight_ptr + k_scaling_idx)) +
+                        (*(DT *)(input_weight_ptr + k_offset_idx));
+    weight_ptr[idx_v] = static_cast<DT>(input_weight_ptr[idx_v] & 0xFF) /
+                            (*(DT *)(input_weight_ptr + v_scaling_idx)) +
+                        (*(DT *)(input_weight_ptr + v_offset_idx));
+    weight_ptr[idx_o] = static_cast<DT>(input_weight_ptr[idx_o] & 0xFF) /
+                            (*(DT *)(input_weight_ptr + o_scaling_idx)) +
+                        (*(DT *)(input_weight_ptr + o_offset_idx));
+  }
+}
+
+template __global__ void decompress_int4_general_weights<float>(
+    char const *input_weight_ptr, float *weight_ptr, int in_dim, int valueSize);
+template __global__ void decompress_int4_general_weights<half>(
+    char const *input_weight_ptr, half *weight_ptr, int in_dim, int valueSize);
+template __global__ void decompress_int8_general_weights<float>(
+    char const *input_weight_ptr, float *weight_ptr, int in_dim, int valueSize);
+template __global__ void decompress_int8_general_weights<half>(
+    char const *input_weight_ptr, half *weight_ptr, int in_dim, int valueSize);
+template __global__ void
+    decompress_int4_attention_weights<float>(char *input_weight_ptr,
+                                             float *weight_ptr,
+                                             int qProjSize,
+                                             int qSize,
+                                             int num_heads);
+
+template __global__ void
+    decompress_int4_attention_weights<half>(char *input_weight_ptr,
+                                            half *weight_ptr,
+                                            int qProjSize,
+                                            int qSize,
+                                            int num_heads);
+
+template __global__ void
+    decompress_int8_attention_weights<float>(char *input_weight_ptr,
+                                             float *weight_ptr,
+                                             int qProjSize,
+                                             int qSize,
+                                             int num_heads);
+
+template __global__ void
+    decompress_int8_attention_weights<half>(char *input_weight_ptr,
+                                            half *weight_ptr,
+                                            int qProjSize,
+                                            int qSize,
+                                            int num_heads);
+// template <typename T1, typename T2>
+// void decompress_weight_bias(T1 *input_weight_ptr,
+//                             T2 *weight_ptr,
+//                             T2 *params,
+//                             int group_size,
+//                             int tensor_size) {
+
+//   // convert to DT, scaling, add offset;
+//   cudaStream_t stream;
+//   checkCUDA(get_legion_stream(&stream));
+//   int parallelism = tensor_size;
+//   decompress_kernel<<<GET_BLOCKS(parallelism),
+//                       min(CUDA_NUM_THREADS, parallelism),
+//                       0,
+//                       stream>>>(
+//       input_weight_ptr, weight_ptr, params, group_size);
+// }
+} // namespace Kernels
+}; // namespace FlexFlow
diff --git a/src/ops/kernels/element_binary_kernels.cpp b/src/ops/kernels/element_binary_kernels.cpp
index 4cdc839b59..3aef875d1f 100644
--- a/src/ops/kernels/element_binary_kernels.cpp
+++ b/src/ops/kernels/element_binary_kernels.cpp
@@ -22,7 +22,8 @@ namespace FlexFlow {
 using Legion::coord_t;
 using Legion::Domain;
 
-ElementBinaryMeta::ElementBinaryMeta(FFHandler handler) : OpMeta(handler) {
+ElementBinaryMeta::ElementBinaryMeta(FFHandler handler, Op const *op)
+    : OpMeta(handler, op) {
   checkCUDNN(miopenCreateTensorDescriptor(&input1Tensor));
   checkCUDNN(miopenCreateTensorDescriptor(&input2Tensor));
   checkCUDNN(miopenCreateTensorDescriptor(&outputTensor));
@@ -67,9 +68,9 @@ void init_kernel(ElementBinaryMeta *m,
 
 /*static*/
 void forward_kernel_wrapper(ElementBinaryMeta const *m,
-                            float const *in1_ptr,
-                            float const *in2_ptr,
-                            float *out_ptr) {
+                            GenericTensorAccessorR const &in1,
+                            GenericTensorAccessorR const &in2,
+                            GenericTensorAccessorW const &out) {
   hipStream_t stream;
   checkCUDA(get_legion_stream(&stream));
 
@@ -81,7 +82,8 @@ void forward_kernel_wrapper(ElementBinaryMeta const *m,
   }
   // print_tensor<float>(in1_ptr, in1_domain.get_volume(), "input1:");
   // print_tensor<float>(in2_ptr, in2_domain.get_volume(), "input2:");
-  Internal::forward_kernel(m, in1_ptr, in2_ptr, out_ptr, stream);
+  Internal::forward_kernel(
+      m, in1.get_float_ptr(), in2.get_float_ptr(), out.get_float_ptr(), stream);
   // print_tensor<float>(out_ptr, in1_domain.get_volume(), "output:");
   if (m->profiling) {
     hipEventRecord(t_end, stream);
@@ -238,10 +240,11 @@ __global__ void elewise_binary_backward_kernel(coord_t volume,
 }
 
 /*static*/
+template <typename DT>
 void forward_kernel(ElementBinaryMeta const *m,
-                    float const *in1_ptr,
-                    float const *in2_ptr,
-                    float *out_ptr,
+                    DT const *in1_ptr,
+                    DT const *in2_ptr,
+                    DT *out_ptr,
                     hipStream_t stream) {
   checkCUDA(hipblasSetStream(m->handle.blas, stream));
   checkCUDNN(miopenSetStream(m->handle.dnn, stream));
diff --git a/src/ops/kernels/element_binary_kernels.cu b/src/ops/kernels/element_binary_kernels.cu
index cfa9f18279..6d30ae690a 100644
--- a/src/ops/kernels/element_binary_kernels.cu
+++ b/src/ops/kernels/element_binary_kernels.cu
@@ -21,7 +21,8 @@ namespace FlexFlow {
 using Legion::coord_t;
 using Legion::Domain;
 
-ElementBinaryMeta::ElementBinaryMeta(FFHandler handler) : OpMeta(handler) {
+ElementBinaryMeta::ElementBinaryMeta(FFHandler handler, Op const *op)
+    : OpMeta(handler, op) {
   checkCUDNN(cudnnCreateTensorDescriptor(&input1Tensor));
   checkCUDNN(cudnnCreateTensorDescriptor(&input2Tensor));
   checkCUDNN(cudnnCreateTensorDescriptor(&outputTensor));
@@ -61,27 +62,28 @@ void init_kernel(ElementBinaryMeta *m,
     default:
       assert(false);
   }
+  cudnnDataType_t cudnn_data_type = ff_to_cudnn_datatype(m->output_type[0]);
   checkCUDNN(cudnnSetOpTensorDescriptor(
       m->opDesc, mode, CUDNN_DATA_FLOAT, CUDNN_PROPAGATE_NAN));
   checkCUDNN(cudnnSetReduceTensorDescriptor(m->reduceAddDesc,
                                             CUDNN_REDUCE_TENSOR_ADD,
-                                            CUDNN_DATA_FLOAT,
+                                            cudnn_data_type,
                                             CUDNN_PROPAGATE_NAN,
                                             CUDNN_REDUCE_TENSOR_NO_INDICES,
                                             CUDNN_32BIT_INDICES));
-  checkCUDNN(
-      cudnnSetTensorDescriptorFromDomain(m->input1Tensor, input1_domain));
-  checkCUDNN(
-      cudnnSetTensorDescriptorFromDomain(m->input2Tensor, input2_domain));
-  checkCUDNN(
-      cudnnSetTensorDescriptorFromDomain(m->outputTensor, output_domain));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->input1Tensor, input1_domain, m->input_type[0]));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->input2Tensor, input2_domain, m->input_type[1]));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain(
+      m->outputTensor, output_domain, m->output_type[0]));
 }
 
 /*static*/
 void forward_kernel_wrapper(ElementBinaryMeta const *m,
-                            float const *in1_ptr,
-                            float const *in2_ptr,
-                            float *out_ptr) {
+                            GenericTensorAccessorR const &in1,
+                            GenericTensorAccessorR const &in2,
+                            GenericTensorAccessorW const &out) {
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
 
@@ -91,7 +93,20 @@ void forward_kernel_wrapper(ElementBinaryMeta const *m,
     cudaEventCreate(&t_end);
     cudaEventRecord(t_start, stream);
   }
-  Internal::forward_kernel(m, in1_ptr, in2_ptr, out_ptr, stream);
+  assert(in1.data_type == in2.data_type);
+  assert(out.data_type == in1.data_type);
+  if (out.data_type == DT_HALF) {
+    Internal::forward_kernel(
+        m, in1.get_half_ptr(), in2.get_half_ptr(), out.get_half_ptr(), stream);
+  } else if (out.data_type == DT_FLOAT) {
+    Internal::forward_kernel(m,
+                             in1.get_float_ptr(),
+                             in2.get_float_ptr(),
+                             out.get_float_ptr(),
+                             stream);
+  } else {
+    assert(false && "Unsupported data type");
+  }
   if (m->profiling) {
     cudaEventRecord(t_end, stream);
     checkCUDA(cudaEventSynchronize(t_end));
@@ -292,10 +307,11 @@ __global__ void elewise_binary_backward_kernel(coord_t volume,
 }
 
 /*static*/
+template <typename DT>
 void forward_kernel(ElementBinaryMeta const *m,
-                    float const *in1_ptr,
-                    float const *in2_ptr,
-                    float *out_ptr,
+                    DT const *in1_ptr,
+                    DT const *in2_ptr,
+                    DT *out_ptr,
                     cudaStream_t stream) {
   checkCUDA(cublasSetStream(m->handle.blas, stream));
   checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
diff --git a/src/ops/kernels/embedding_kernels.cu b/src/ops/kernels/embedding_kernels.cu
index 65f3089409..22d8161ff1 100644
--- a/src/ops/kernels/embedding_kernels.cu
+++ b/src/ops/kernels/embedding_kernels.cu
@@ -60,7 +60,7 @@ void forward_kernel_wrapper(EmbeddingMeta const *m,
                                m->aggr,
                                output.domain.get_volume(),
                                stream);
-    } else if (weight.data_type == DT_HALF) {
+    } else if (weight.data_type == DT_DOUBLE) {
       Internal::forward_kernel(input.get_int32_ptr(),
                                output.get_double_ptr(),
                                weight.get_double_ptr(),
diff --git a/src/ops/kernels/linear_kernels.cpp b/src/ops/kernels/linear_kernels.cpp
index 8066ddc812..0d70e91d47 100644
--- a/src/ops/kernels/linear_kernels.cpp
+++ b/src/ops/kernels/linear_kernels.cpp
@@ -19,7 +19,12 @@
 
 namespace FlexFlow {
 
-LinearMeta::LinearMeta(FFHandler handler, int batch_size) : OpMeta(handler) {
+LinearMeta::LinearMeta(FFHandler handler,
+                       int batch_size,
+                       Linear const *li,
+                       MemoryAllocator gpu_mem_allocator,
+                       int weightSize)
+    : OpMeta(handler, li) {
   // Allocate an all-one's vector
   float *dram_one_ptr = (float *)malloc(sizeof(float) * batch_size);
   for (int i = 0; i < batch_size; i++) {
@@ -31,11 +36,12 @@ LinearMeta::LinearMeta(FFHandler handler, int batch_size) : OpMeta(handler) {
                       dram_one_ptr,
                       sizeof(float) * batch_size,
                       hipMemcpyHostToDevice));
-  one_ptr = (float const *)fb_one_ptr;
+  one_ptr = (void *)fb_one_ptr;
   // Allocate descriptors
   checkCUDNN(miopenCreateActivationDescriptor(&actiDesc));
   checkCUDNN(miopenCreateTensorDescriptor(&outputTensor));
 }
+LinearMeta::~LinearMeta(void) {}
 
 namespace Kernels {
 namespace Linear {
@@ -70,12 +76,13 @@ void Linear::init_kernel(LinearMeta *m, int batch_size, int channel) {
         assert(false);
     }
     checkCUDNN(miopenSetActivationDescriptor(m->actiDesc, mode, 0.0, 0.0, 0.0));
-    checkCUDNN(miopenSet4dTensorDescriptor(m->outputTensor,
-                                           ff_to_cudnn_datatype(m->output_type),
-                                           batch_size,
-                                           channel,
-                                           1,
-                                           1));
+    checkCUDNN(
+        miopenSet4dTensorDescriptor(m->outputTensor,
+                                    ff_to_cudnn_datatype(m->output_type[0]),
+                                    batch_size,
+                                    channel,
+                                    1,
+                                    1));
   }
 }
 
@@ -96,15 +103,28 @@ void forward_kernel_wrapper(LinearMeta const *m,
     hipEventCreate(&t_end);
     hipEventRecord(t_start, stream);
   }
-  Internal::forward_kernel(m,
-                           input_ptr,
-                           output_ptr,
-                           weight_ptr,
-                           bias_ptr,
-                           in_dim,
-                           out_dim,
-                           batch_size,
-                           stream);
+
+  if (m->input_type[0] == DT_FLOAT) {
+    Internal::forward_kernel<float>(m,
+                                    input_ptr,
+                                    output_ptr,
+                                    weight_ptr,
+                                    bias_ptr,
+                                    in_dim,
+                                    out_dim,
+                                    batch_size,
+                                    stream);
+  } else if (m->input_type[0] == DT_HALF) {
+    Internal::forward_kernel<half>(m,
+                                   input_ptr,
+                                   output_ptr,
+                                   weight_ptr,
+                                   bias_ptr,
+                                   in_dim,
+                                   out_dim,
+                                   batch_size,
+                                   stream);
+  }
 
   if (m->profiling) {
     hipEventRecord(t_end, stream);
@@ -143,18 +163,34 @@ void backward_kernel_wrapper(LinearMeta const *m,
     hipEventCreate(&t_end);
     hipEventRecord(t_start, stream);
   }
-  Internal::backward_kernel(m,
-                            input_ptr,
-                            input_grad_ptr,
-                            output_ptr,
-                            output_grad_ptr,
-                            kernel_ptr,
-                            kernel_grad_ptr,
-                            bias_grad_ptr,
-                            in_dim,
-                            out_dim,
-                            batch_size,
-                            stream);
+  if (m->input_type[0] == DT_FLOAT) {
+    Internal::backward_kernel<float>(m,
+                                     input_ptr,
+                                     input_grad_ptr,
+                                     output_ptr,
+                                     output_grad_ptr,
+                                     kernel_ptr,
+                                     kernel_grad_ptr,
+                                     bias_grad_ptr,
+                                     in_dim,
+                                     out_dim,
+                                     batch_size,
+                                     stream);
+  } else if (m->input_type[0] == DT_HALF) {
+    Internal::backward_kernel<half>(m,
+                                    input_ptr,
+                                    input_grad_ptr,
+                                    output_ptr,
+                                    output_grad_ptr,
+                                    kernel_ptr,
+                                    kernel_grad_ptr,
+                                    bias_grad_ptr,
+                                    in_dim,
+                                    out_dim,
+                                    batch_size,
+                                    stream);
+  }
+
   if (m->profiling) {
     hipEventRecord(t_end, stream);
     checkCUDA(hipEventSynchronize(t_end));
@@ -189,7 +225,7 @@ Parameter* Linear::get_parameter(int index)
 */
 
 namespace Internal {
-
+template <typename DT>
 void forward_kernel(LinearMeta const *m,
                     void const *input_ptr,
                     void *output_ptr,
@@ -201,15 +237,15 @@ void forward_kernel(LinearMeta const *m,
                     hipStream_t stream) {
   checkCUDA(hipblasSetStream(m->handle.blas, stream));
   checkCUDNN(miopenSetStream(m->handle.dnn, stream));
-  float alpha = 1.0f, beta = 0.0f;
-  hipblasDatatype_t input_type = ff_to_cuda_datatype(m->input_type);
-  hipblasDatatype_t weight_type = ff_to_cuda_datatype(m->weight_type);
-  hipblasDatatype_t output_type = ff_to_cuda_datatype(m->output_type);
+  DT alpha = 1.0f, beta = 0.0f;
+  hipblasDatatype_t input_type = ff_to_cuda_datatype(m->input_type[0]);
+  hipblasDatatype_t weight_type = ff_to_cuda_datatype(m->weight_type[0]);
+  hipblasDatatype_t output_type = ff_to_cuda_datatype(m->output_type[0]);
 #if CUDA_VERSION >= 11000
   // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
   cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
 #else
-  hipblasDatatype_t compute_type = HIPBLAS_R_32F;
+  hipblasDatatype_t compute_type = input_type;
 #endif
   checkCUDA(hipblasGemmEx(m->handle.blas,
                           HIPBLAS_OP_T,
@@ -242,8 +278,8 @@ void forward_kernel(LinearMeta const *m,
                             bias_ptr,
                             weight_type,
                             1,
-                            m->one_ptr,
-                            HIPBLAS_R_32F,
+                            static_cast<DT *>(m->one_ptr),
+                            weight_type,
                             1,
                             &alpha,
                             output_ptr,
@@ -281,6 +317,7 @@ void forward_kernel(LinearMeta const *m,
   }
 }
 
+template <typename DT>
 void backward_kernel(LinearMeta const *m,
                      void const *input_ptr,
                      void *input_grad_ptr,
@@ -296,10 +333,10 @@ void backward_kernel(LinearMeta const *m,
   checkCUDA(hipblasSetStream(m->handle.blas, stream));
   checkCUDNN(miopenSetStream(m->handle.dnn, stream));
 
-  float alpha = 1.0f;
-  hipblasDatatype_t input_type = ff_to_cuda_datatype(m->input_type);
-  hipblasDatatype_t weight_type = ff_to_cuda_datatype(m->weight_type);
-  hipblasDatatype_t output_type = ff_to_cuda_datatype(m->output_type);
+  DT alpha = 1.0f;
+  hipblasDatatype_t input_type = ff_to_cuda_datatype(m->input_type[0]);
+  hipblasDatatype_t weight_type = ff_to_cuda_datatype(m->weight_type[0]);
+  hipblasDatatype_t output_type = ff_to_cuda_datatype(m->output_type[0]);
 #if CUDA_VERSION >= 11000
   // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
   cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
@@ -309,10 +346,10 @@ void backward_kernel(LinearMeta const *m,
   int output_size = out_dim * batch_size;
   if (m->activation == AC_MODE_RELU) {
     relu_backward_kernel(
-        m->output_type, output_grad_ptr, output_ptr, output_size, stream);
+        m->output_type[0], output_grad_ptr, output_ptr, output_size, stream);
   } else if (m->activation == AC_MODE_SIGMOID) {
     sigmoid_backward_kernel(
-        m->output_type, output_grad_ptr, output_ptr, output_size, stream);
+        m->output_type[0], output_grad_ptr, output_ptr, output_size, stream);
   } else {
     // TODO: only support relu and sigmoid for now
     assert(m->activation == AC_MODE_NONE);
diff --git a/src/ops/kernels/linear_kernels.cu b/src/ops/kernels/linear_kernels.cu
index 3f408c7cb0..8a93357dcf 100644
--- a/src/ops/kernels/linear_kernels.cu
+++ b/src/ops/kernels/linear_kernels.cu
@@ -13,29 +13,64 @@
  * limitations under the License.
  */
 
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/kernels/decompress_kernels.h"
 #include "flexflow/ops/kernels/linear_kernels.h"
 #include "flexflow/utils/cuda_helper.h"
 
 namespace FlexFlow {
 
-LinearMeta::LinearMeta(FFHandler handler, int batch_size) : OpMeta(handler) {
+LinearMeta::LinearMeta(FFHandler handler,
+                       int batch_size,
+                       Linear const *li,
+                       MemoryAllocator gpu_mem_allocator,
+                       int weightSize)
+    : OpMeta(handler, li), weight_ptr(nullptr) {
+  DataType data_type = li->data_type;
+  // allocate weight and bias in the reserve space for cpu offloading
+  if (li->offload) {
+    weight_ptr = gpu_mem_allocator.allocate_reserved_untyped(
+        weightSize * data_type_size(data_type));
+    if (li->quantization_type != DT_NONE) {
+      quantized_weightSize = get_quantization_to_byte_size(
+          data_type, li->quantization_type, weightSize);
+      quantized_weight_ptr =
+          gpu_mem_allocator.allocate_reserved<char>(quantized_weightSize);
+    }
+  }
   // Allocate an all-one's vector
-  float *dram_one_ptr = (float *)malloc(sizeof(float) * batch_size);
-  for (int i = 0; i < batch_size; i++) {
-    dram_one_ptr[i] = 1.0f;
+  gpu_mem_allocator.create_legion_instance(
+      reserveInst, data_type_size(data_type) * batch_size);
+  one_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * batch_size);
+  int parallelism = batch_size;
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  if (data_type == DT_FLOAT) {
+    Kernels::Linear::Internal::
+        build_one_ptr<<<GET_BLOCKS(parallelism),
+                        min(CUDA_NUM_THREADS, parallelism),
+                        0,
+                        stream>>>((float *)one_ptr, batch_size);
+  } else if (data_type == DT_HALF) {
+    Kernels::Linear::Internal::
+        build_one_ptr<<<GET_BLOCKS(parallelism),
+                        min(CUDA_NUM_THREADS, parallelism),
+                        0,
+                        stream>>>((half *)one_ptr, batch_size);
   }
-  float *fb_one_ptr;
-  checkCUDA(cudaMalloc(&fb_one_ptr, sizeof(float) * batch_size));
-  checkCUDA(cudaMemcpy(fb_one_ptr,
-                       dram_one_ptr,
-                       sizeof(float) * batch_size,
-                       cudaMemcpyHostToDevice));
-  one_ptr = (float const *)fb_one_ptr;
+
   // Allocate descriptors
   checkCUDNN(cudnnCreateActivationDescriptor(&actiDesc));
   checkCUDNN(cudnnCreateTensorDescriptor(&outputTensor));
 }
 
+LinearMeta::~LinearMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
+}
+
 namespace Kernels {
 namespace Linear {
 
@@ -70,13 +105,14 @@ void init_kernel(LinearMeta *m, int batch_size, int channel) {
     }
     checkCUDNN(cudnnSetActivationDescriptor(
         m->actiDesc, mode, CUDNN_PROPAGATE_NAN, 0.0));
-    checkCUDNN(cudnnSetTensor4dDescriptor(m->outputTensor,
-                                          CUDNN_TENSOR_NCHW,
-                                          ff_to_cudnn_datatype(m->output_type),
-                                          batch_size,
-                                          channel,
-                                          1,
-                                          1));
+    checkCUDNN(
+        cudnnSetTensor4dDescriptor(m->outputTensor,
+                                   CUDNN_TENSOR_NCHW,
+                                   ff_to_cudnn_datatype(m->output_type[0]),
+                                   batch_size,
+                                   channel,
+                                   1,
+                                   1));
   }
 }
 
@@ -90,22 +126,33 @@ void forward_kernel_wrapper(LinearMeta const *m,
                             int batch_size) {
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
-
   cudaEvent_t t_start, t_end;
   if (m->profiling) {
     cudaEventCreate(&t_start);
     cudaEventCreate(&t_end);
     cudaEventRecord(t_start, stream);
   }
-  Internal::forward_kernel(m,
-                           input_ptr,
-                           output_ptr,
-                           weight_ptr,
-                           bias_ptr,
-                           in_dim,
-                           out_dim,
-                           batch_size,
-                           stream);
+  if (m->input_type[0] == DT_FLOAT) {
+    Internal::forward_kernel<float>(m,
+                                    input_ptr,
+                                    output_ptr,
+                                    weight_ptr,
+                                    bias_ptr,
+                                    in_dim,
+                                    out_dim,
+                                    batch_size,
+                                    stream);
+  } else if (m->input_type[0] == DT_HALF) {
+    Internal::forward_kernel<half>(m,
+                                   input_ptr,
+                                   output_ptr,
+                                   weight_ptr,
+                                   bias_ptr,
+                                   in_dim,
+                                   out_dim,
+                                   batch_size,
+                                   stream);
+  }
 
   if (m->profiling) {
     cudaEventRecord(t_end, stream);
@@ -143,18 +190,34 @@ void backward_kernel_wrapper(LinearMeta const *m,
     cudaEventCreate(&t_end);
     cudaEventRecord(t_start, stream);
   }
-  Internal::backward_kernel(m,
-                            input_ptr,
-                            input_grad_ptr,
-                            output_ptr,
-                            output_grad_ptr,
-                            kernel_ptr,
-                            kernel_grad_ptr,
-                            bias_grad_ptr,
-                            in_dim,
-                            out_dim,
-                            batch_size,
-                            stream);
+  if (m->input_type[0] == DT_FLOAT) {
+    Internal::backward_kernel<float>(m,
+                                     input_ptr,
+                                     input_grad_ptr,
+                                     output_ptr,
+                                     output_grad_ptr,
+                                     kernel_ptr,
+                                     kernel_grad_ptr,
+                                     bias_grad_ptr,
+                                     in_dim,
+                                     out_dim,
+                                     batch_size,
+                                     stream);
+  } else if (m->input_type[0] == DT_HALF) {
+    Internal::backward_kernel<half>(m,
+                                    input_ptr,
+                                    input_grad_ptr,
+                                    output_ptr,
+                                    output_grad_ptr,
+                                    kernel_ptr,
+                                    kernel_grad_ptr,
+                                    bias_grad_ptr,
+                                    in_dim,
+                                    out_dim,
+                                    batch_size,
+                                    stream);
+  }
+
   if (m->profiling) {
     cudaEventRecord(t_end, stream);
     checkCUDA(cudaEventSynchronize(t_end));
@@ -189,6 +252,7 @@ Parameter* Linear::get_parameter(int index)
 */
 namespace Internal {
 
+template <typename DT>
 void forward_kernel(LinearMeta const *m,
                     void const *input_ptr,
                     void *output_ptr,
@@ -198,17 +262,60 @@ void forward_kernel(LinearMeta const *m,
                     int out_dim,
                     int batch_size,
                     ffStream_t stream) {
+  // additional processing for uploading weights
+  if (m->offload) {
+    // Note that we update weight_ptr when uploading weight
+    if (m->quantization_type != DT_NONE) {
+      cudaMemcpyAsync(m->quantized_weight_ptr,
+                      weight_ptr,
+                      m->quantized_weightSize,
+                      cudaMemcpyHostToDevice,
+                      stream);
+      if (m->quantization_type == DT_INT4) {
+        int parallelism = in_dim * out_dim / 2;
+        decompress_int4_general_weights<DT>
+            <<<GET_BLOCKS(parallelism),
+               min(CUDA_NUM_THREADS, parallelism),
+               0,
+               stream>>>(m->quantized_weight_ptr,
+                         static_cast<DT *>(m->weight_ptr),
+                         in_dim,
+                         in_dim * out_dim);
+      } else {
+        assert(m->quantization_type == DT_INT8);
+        int parallelism = in_dim * out_dim;
+        decompress_int8_general_weights<DT>
+            <<<GET_BLOCKS(parallelism),
+               min(CUDA_NUM_THREADS, parallelism),
+               0,
+               stream>>>(m->quantized_weight_ptr,
+                         static_cast<DT *>(m->weight_ptr),
+                         in_dim,
+                         in_dim * out_dim);
+      }
+
+    } else {
+      cudaMemcpyAsync(m->weight_ptr,
+                      weight_ptr,
+                      in_dim * out_dim * sizeof(DT),
+                      cudaMemcpyHostToDevice,
+                      stream);
+    }
+  }
   checkCUDA(cublasSetStream(m->handle.blas, stream));
   checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
-  float alpha = 1.0f, beta = 0.0f;
-  cudaDataType_t input_type = ff_to_cuda_datatype(m->input_type);
-  cudaDataType_t weight_type = ff_to_cuda_datatype(m->weight_type);
-  cudaDataType_t output_type = ff_to_cuda_datatype(m->output_type);
+  DT alpha = 1.0f, beta = 0.0f;
+  cudaDataType_t input_type = ff_to_cuda_datatype(m->input_type[0]);
+  cudaDataType_t weight_type = m->offload
+                                   ? ff_to_cuda_datatype(m->weight_ptr_type)
+                                   : ff_to_cuda_datatype(m->weight_type[0]);
+  cudaDataType_t output_type = ff_to_cuda_datatype(m->output_type[0]);
+  assert(input_type == weight_type && weight_type == output_type);
 #if CUDA_VERSION >= 11000
   // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
   cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
 #else
-  cudaDataType_t compute_type = CUDA_R_32F;
+  cudaDataType_t compute_type = input_type;
 #endif
   checkCUDA(cublasGemmEx(m->handle.blas,
                          CUBLAS_OP_T,
@@ -217,7 +324,7 @@ void forward_kernel(LinearMeta const *m,
                          batch_size,
                          in_dim,
                          &alpha,
-                         weight_ptr,
+                         m->offload ? m->weight_ptr : weight_ptr,
                          weight_type,
                          in_dim,
                          input_ptr,
@@ -241,8 +348,8 @@ void forward_kernel(LinearMeta const *m,
                            bias_ptr,
                            weight_type,
                            1,
-                           m->one_ptr,
-                           CUDA_R_32F,
+                           static_cast<DT *>(m->one_ptr),
+                           weight_type,
                            1,
                            &alpha,
                            output_ptr,
@@ -273,6 +380,7 @@ void forward_kernel(LinearMeta const *m,
   }
 }
 
+template <typename DT>
 void backward_kernel(LinearMeta const *m,
                      void const *input_ptr,
                      void *input_grad_ptr,
@@ -288,10 +396,11 @@ void backward_kernel(LinearMeta const *m,
   checkCUDA(cublasSetStream(m->handle.blas, stream));
   checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
 
-  float alpha = 1.0f;
-  cudaDataType_t input_type = ff_to_cuda_datatype(m->input_type);
-  cudaDataType_t weight_type = ff_to_cuda_datatype(m->weight_type);
-  cudaDataType_t output_type = ff_to_cuda_datatype(m->output_type);
+  DT alpha = 1.0f;
+  float sgeam_alpha = 1.0f;
+  cudaDataType_t input_type = ff_to_cuda_datatype(m->input_type[0]);
+  cudaDataType_t weight_type = ff_to_cuda_datatype(m->weight_type[0]);
+  cudaDataType_t output_type = ff_to_cuda_datatype(m->output_type[0]);
 #if CUDA_VERSION >= 11000
   // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
   cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
@@ -301,10 +410,10 @@ void backward_kernel(LinearMeta const *m,
   int output_size = out_dim * batch_size;
   if (m->activation == AC_MODE_RELU) {
     relu_backward_kernel(
-        m->output_type, output_grad_ptr, output_ptr, output_size, stream);
+        m->output_type[0], output_grad_ptr, output_ptr, output_size, stream);
   } else if (m->activation == AC_MODE_SIGMOID) {
     sigmoid_backward_kernel(
-        m->output_type, output_grad_ptr, output_ptr, output_size, stream);
+        m->output_type[0], output_grad_ptr, output_ptr, output_size, stream);
   } else {
     // TODO: only support relu and sigmoid for now
     assert(m->activation == AC_MODE_NONE);
@@ -338,7 +447,7 @@ void backward_kernel(LinearMeta const *m,
                           CUBLAS_OP_N,
                           in_dim,
                           out_dim,
-                          &alpha,
+                          &sgeam_alpha,
                           (float *)kernel_grad_ptr,
                           in_dim,
                           &(m->kernel_reg_lambda),
@@ -361,7 +470,7 @@ void backward_kernel(LinearMeta const *m,
                            out_dim,
                            batch_size,
                            &alpha,
-                           m->one_ptr,
+                           static_cast<DT *>(m->one_ptr),
                            CUDA_R_32F,
                            1,
                            output_grad_ptr,
@@ -399,6 +508,13 @@ void backward_kernel(LinearMeta const *m,
   }
 }
 
+template <typename DT>
+__global__ void build_one_ptr(DT *one_ptr, int batch_size) {
+  CUDA_KERNEL_LOOP(i, batch_size) {
+    one_ptr[i] = static_cast<DT>(1.0f);
+  }
+}
+
 } // namespace Internal
 } // namespace Linear
 } // namespace Kernels
diff --git a/src/ops/kernels/rms_norm_kernels.cpp b/src/ops/kernels/rms_norm_kernels.cpp
new file mode 100644
index 0000000000..b2e2648785
--- /dev/null
+++ b/src/ops/kernels/rms_norm_kernels.cpp
@@ -0,0 +1,61 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/kernels/rms_norm_kernels.h"
+#include "flexflow/ops/rms_norm.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::coord_t;
+
+RMSNormMeta::RMSNormMeta(FFHandler handler,
+                         RMSNorm const *rms,
+                         MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler, rms) {}
+RMSNormMeta::~RMSNormMeta(void) {}
+namespace Kernels {
+namespace RMSNorm {
+
+void forward_kernel_wrapper(RMSNormMeta const *m,
+                            GenericTensorAccessorR const &input,
+                            GenericTensorAccessorR const &weight,
+                            GenericTensorAccessorW const &output) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  handle_unimplemented_hip_kernel(OP_RMS_NORM);
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+  }
+}
+
+} // namespace RMSNorm
+} // namespace Kernels
+} // namespace FlexFlow
diff --git a/src/ops/kernels/rms_norm_kernels.cu b/src/ops/kernels/rms_norm_kernels.cu
new file mode 100644
index 0000000000..234bf73150
--- /dev/null
+++ b/src/ops/kernels/rms_norm_kernels.cu
@@ -0,0 +1,207 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/kernels/rms_norm_kernels.h"
+#include "flexflow/ops/rms_norm.h"
+#include "flexflow/utils/cuda_helper.h"
+#include <cublas_v2.h>
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::coord_t;
+
+#define C10_WARP_SIZE 32
+constexpr int kCUDABlockReduceNumThreads = 512;
+constexpr int kCUDANumThreads = 256;
+
+RMSNormMeta::RMSNormMeta(FFHandler handler,
+                         RMSNorm const *rms,
+                         MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler, rms) {
+  eps = rms->eps;
+  alpha = 1.0f;
+  beta = 0.0f;
+
+  in_dim = rms->data_dim;
+  batch_size = rms->effective_batch_size;
+  num_elements = in_dim * batch_size;
+
+  DataType data_type = rms->weights[0]->data_type;
+  size_t rms_ptr_size = batch_size;
+  size_t norm_ptr_size = num_elements;
+  size_t totalSize = (rms_ptr_size + norm_ptr_size) * data_type_size(data_type);
+  gpu_mem_allocator.create_legion_instance(reserveInst, totalSize);
+  rms_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      rms_ptr_size * data_type_size(data_type));
+  norm_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      norm_ptr_size * data_type_size(data_type));
+}
+RMSNormMeta::~RMSNormMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
+}
+
+namespace Kernels {
+namespace RMSNorm {
+
+template <typename T>
+__device__ __forceinline__ T WARP_SHFL_DOWN(T value,
+                                            unsigned int delta,
+                                            int width = warpSize,
+                                            unsigned int mask = 0xffffffff) {
+#ifndef __HIP_PLATFORM_HCC__
+  return __shfl_down_sync(mask, value, delta, width);
+#else
+  return __shfl_down(value, delta, width);
+#endif
+}
+
+template <typename T>
+__inline__ __device__ T WarpReduceSum(T val) {
+#pragma unroll
+  for (int offset = (C10_WARP_SIZE >> 1); offset > 0; offset >>= 1) {
+    val += WARP_SHFL_DOWN(val, offset);
+  }
+  return val;
+}
+
+template <typename T>
+__inline__ __device__ T BlockReduceSum(T val, T *shared) {
+  int const lid = threadIdx.x % C10_WARP_SIZE;
+  int const wid = threadIdx.x / C10_WARP_SIZE;
+  val = WarpReduceSum(val);
+  __syncthreads();
+  if (lid == 0) {
+    shared[wid] = val;
+  }
+  __syncthreads();
+  val = (threadIdx.x < (blockDim.x / C10_WARP_SIZE)) ? shared[lid] : T(0);
+  if (wid == 0) {
+    val = WarpReduceSum(val);
+  }
+  return val;
+}
+
+template <typename T>
+__global__ void
+    RowwiseRootMeanSquareKernel(long long N, float eps, T const *X, T *rms) {
+  __shared__ float v_shared[C10_WARP_SIZE];
+  long long const i = blockIdx.x;
+  float sum = 0.0f;
+  for (long long j = threadIdx.x; j < N; j += blockDim.x) {
+    long long const index = i * N + j;
+    sum += (static_cast<float>(X[index]) * static_cast<float>(X[index]));
+  }
+  sum = BlockReduceSum<float>(sum,
+                              v_shared); // use BlockReduceSum() to sum X_ij^2
+
+  if (threadIdx.x == 0) {
+    rms[i] = static_cast<T>(rsqrt((sum / static_cast<float>(N)) + eps));
+  }
+}
+
+template <typename T>
+__global__ void NormKernel(int64_t N, T const *X, T const *rstd, T *Y) {
+  using T_ACC = T;
+  const int64_t i = blockIdx.x;
+  for (int64_t j = threadIdx.x; j < N; j += blockDim.x) {
+    const int64_t index = i * N + j;
+    Y[index] = static_cast<T_ACC>(X[index]) * static_cast<T_ACC>(rstd[i]);
+  }
+}
+
+template <typename T>
+__global__ void elewise_apply_weights(int64_t batch_size,
+                                      int64_t in_dim,
+                                      T const *norm,
+                                      T const *weights,
+                                      T *output) {
+  CUDA_KERNEL_LOOP(i, batch_size * in_dim) {
+    output[i] = norm[i] * weights[i % in_dim];
+  }
+}
+
+template <typename T>
+void forward_kernel(RMSNormMeta const *m,
+                    T const *input_ptr,
+                    T const *weight_ptr,
+                    T *output_ptr,
+                    cudaStream_t stream) {
+  int parallelism = m->batch_size * m->in_dim;
+  RowwiseRootMeanSquareKernel<T>
+      <<<m->batch_size, kCUDABlockReduceNumThreads, 0, stream>>>(
+          m->in_dim, m->eps, input_ptr, static_cast<T *>(m->rms_ptr));
+  NormKernel<T><<<m->batch_size, kCUDANumThreads, 0, stream>>>(
+      m->in_dim,
+      input_ptr,
+      static_cast<T *>(m->rms_ptr),
+      static_cast<T *>(m->norm_ptr));
+  elewise_apply_weights<<<GET_BLOCKS(parallelism),
+                          min(CUDA_NUM_THREADS, parallelism),
+                          0,
+                          stream>>>(m->batch_size,
+                                    m->in_dim,
+                                    static_cast<T *>(m->norm_ptr),
+                                    weight_ptr,
+                                    output_ptr);
+}
+
+void forward_kernel_wrapper(RMSNormMeta const *m,
+                            GenericTensorAccessorR const &input,
+                            GenericTensorAccessorR const &weight,
+                            GenericTensorAccessorW const &output) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  assert(output.data_type == input.data_type);
+  assert(weight.data_type == output.data_type);
+  if (output.data_type == DT_HALF) {
+    forward_kernel(m,
+                   input.get_half_ptr(),
+                   weight.get_half_ptr(),
+                   output.get_half_ptr(),
+                   stream);
+  } else if (output.data_type == DT_FLOAT) {
+    forward_kernel(m,
+                   input.get_float_ptr(),
+                   weight.get_float_ptr(),
+                   output.get_float_ptr(),
+                   stream);
+  } else {
+    assert(false && "Unsupported data type");
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("[RMSNorm] forward time (CF) = %.2fms\n", elapsed);
+  }
+}
+
+} // namespace RMSNorm
+} // namespace Kernels
+} // namespace FlexFlow
diff --git a/src/ops/kernels/softmax.cpp b/src/ops/kernels/softmax.cpp
index d63bd0edc5..d09a5aaf6d 100644
--- a/src/ops/kernels/softmax.cpp
+++ b/src/ops/kernels/softmax.cpp
@@ -36,9 +36,10 @@ SoftmaxMeta::SoftmaxMeta(FFHandler handler,
 namespace Kernels {
 namespace Softmax {
 
+template <typename DT>
 void forward_kernel_wrapper(SoftmaxMeta const *m,
-                            float const *input_ptr,
-                            float *output_ptr) {
+                            DT const *input_ptr,
+                            DT *output_ptr) {
   hipStream_t stream;
   checkCUDA(get_legion_stream(&stream));
 
@@ -64,9 +65,10 @@ void forward_kernel_wrapper(SoftmaxMeta const *m,
   }
 }
 
+template <typename DT>
 void backward_kernel_wrapper(SoftmaxMeta const *m,
-                             float *input_grad_ptr,
-                             float const *output_grad_ptr,
+                             DT *input_grad_ptr,
+                             DT const *output_grad_ptr,
                              size_t num_elements) {
   hipStream_t stream;
   checkCUDA(get_legion_stream(&stream));
@@ -94,11 +96,27 @@ void backward_kernel_wrapper(SoftmaxMeta const *m,
   }
 }
 
-namespace Internal {
+template void forward_kernel_wrapper<float>(SoftmaxMeta const *m,
+                                            float const *input_ptr,
+                                            float *output_ptr);
+template void forward_kernel_wrapper<half>(SoftmaxMeta const *m,
+                                           half const *input_ptr,
+                                           half *output_ptr);
+
+template void backward_kernel_wrapper<float>(SoftmaxMeta const *m,
+                                             float *input_grad_ptr,
+                                             float const *output_grad_ptr,
+                                             size_t num_elements);
+template void backward_kernel_wrapper<half>(SoftmaxMeta const *m,
+                                            half *input_grad_ptr,
+                                            half const *output_grad_ptr,
+                                            size_t num_elements);
 
+namespace Internal {
+template <typename DT>
 void forward_kernel(SoftmaxMeta const *m,
-                    float const *input_ptr,
-                    float *output_ptr,
+                    DT const *input_ptr,
+                    DT *output_ptr,
                     hipStream_t stream) {
   checkCUDNN(miopenSetStream(m->handle.dnn, stream));
 
@@ -114,13 +132,14 @@ void forward_kernel(SoftmaxMeta const *m,
                                      MIOPEN_SOFTMAX_MODE_CHANNEL));
 }
 
-void backward_kernel(float *input_grad_ptr,
-                     float const *output_grad_ptr,
+template <typename DT>
+void backward_kernel(DT *input_grad_ptr,
+                     DT const *output_grad_ptr,
                      size_t num_elements,
                      hipStream_t stream) {
   checkCUDA(hipMemcpyAsync(input_grad_ptr,
                            output_grad_ptr,
-                           num_elements * sizeof(float),
+                           num_elements * sizeof(DT),
                            hipMemcpyDeviceToDevice,
                            stream));
 }
diff --git a/src/ops/kernels/softmax.cu b/src/ops/kernels/softmax.cu
index d83d9952c9..15130c19a7 100644
--- a/src/ops/kernels/softmax.cu
+++ b/src/ops/kernels/softmax.cu
@@ -26,7 +26,8 @@ SoftmaxMeta::SoftmaxMeta(FFHandler handler,
                          Domain const &input_domain)
     : OpMeta(handler) {
   checkCUDNN(cudnnCreateTensorDescriptor(&inputTensor));
-  checkCUDNN(cudnnSetTensorDescriptorFromDomain(inputTensor, input_domain));
+  checkCUDNN(cudnnSetTensorDescriptorFromDomain4SoftMax(
+      inputTensor, input_domain, softmax->data_type));
   dim = softmax->dim;
   profiling = softmax->profiling;
   std::strcpy(op_name, softmax->name);
@@ -35,9 +36,10 @@ SoftmaxMeta::SoftmaxMeta(FFHandler handler,
 namespace Kernels {
 namespace Softmax {
 
+template <typename DT>
 void forward_kernel_wrapper(SoftmaxMeta const *m,
-                            float const *input_ptr,
-                            float *output_ptr) {
+                            DT const *input_ptr,
+                            DT *output_ptr) {
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
 
@@ -63,9 +65,10 @@ void forward_kernel_wrapper(SoftmaxMeta const *m,
   }
 }
 
+template <typename DT>
 void backward_kernel_wrapper(SoftmaxMeta const *m,
-                             float *input_grad_ptr,
-                             float const *output_grad_ptr,
+                             DT *input_grad_ptr,
+                             DT const *output_grad_ptr,
                              size_t num_elements) {
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
@@ -93,11 +96,26 @@ void backward_kernel_wrapper(SoftmaxMeta const *m,
   }
 }
 
-namespace Internal {
+template void forward_kernel_wrapper<float>(SoftmaxMeta const *m,
+                                            float const *input_ptr,
+                                            float *output_ptr);
+template void forward_kernel_wrapper<half>(SoftmaxMeta const *m,
+                                           half const *input_ptr,
+                                           half *output_ptr);
 
+template void backward_kernel_wrapper<float>(SoftmaxMeta const *m,
+                                             float *input_grad_ptr,
+                                             float const *output_grad_ptr,
+                                             size_t num_elements);
+template void backward_kernel_wrapper<half>(SoftmaxMeta const *m,
+                                            half *input_grad_ptr,
+                                            half const *output_grad_ptr,
+                                            size_t num_elements);
+namespace Internal {
+template <typename DT>
 void forward_kernel(SoftmaxMeta const *m,
-                    float const *input_ptr,
-                    float *output_ptr,
+                    DT const *input_ptr,
+                    DT *output_ptr,
                     cudaStream_t stream) {
   checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
 
@@ -113,13 +131,14 @@ void forward_kernel(SoftmaxMeta const *m,
                                  output_ptr));
 }
 
-void backward_kernel(float *input_grad_ptr,
-                     float const *output_grad_ptr,
+template <typename DT>
+void backward_kernel(DT *input_grad_ptr,
+                     DT const *output_grad_ptr,
                      size_t num_elements,
                      cudaStream_t stream) {
   checkCUDA(cudaMemcpyAsync(input_grad_ptr,
                             output_grad_ptr,
-                            num_elements * sizeof(float),
+                            num_elements * sizeof(DT),
                             cudaMemcpyDeviceToDevice,
                             stream));
 }
diff --git a/src/ops/layer_norm.cc b/src/ops/layer_norm.cc
index 4ad7fb4447..2dca38578f 100644
--- a/src/ops/layer_norm.cc
+++ b/src/ops/layer_norm.cc
@@ -61,10 +61,27 @@ Tensor FFModel::layer_norm(const Tensor input,
                            std::vector<int> const &axes,
                            bool elementwise_affine,
                            float eps,
+                           DataType data_type,
                            char const *name) {
-  // FIXME: currently disable elementwise_affine
-  elementwise_affine = false;
-  // axes must be the last axes.size() dimensions
+  // In PyTorch, axes must be the sizes of the last axes.size() dimensions of
+  // the input tensor. However, since the tensor dimensions are reversed in
+  // FlexFlow (batch size is the last dimension), we require that axes must be
+  // the sizes of the FIRST axes.size() dimensions of the input tensor.
+
+  // Another difference is that in PyTorch, the axes vector should contain the
+  // sizes of the dimensions with respect to which you want to compute the
+  // layernorm. In FlexFlow, instead, axes should contain the INDICES of the
+  // dimensions in question. We do this because the size of a dimension might be
+  // different when splitting a tensor in model parallelism.
+  assert(
+      axes.size() <= input->num_dims &&
+      "number of axes must be less than tensor dimensions"); // input does not
+                                                             // have replica
+                                                             // dimension here
+  for (int i = 0; i < axes.size(); i++) {
+    assert(axes[i] == i && "axes must be the first axes.size() dimensions");
+  }
+#ifdef DEADCODE
   for (int i = 0; i < axes.size(); i++) {
     bool found = false;
     for (int j = 0; j < axes.size(); j++) {
@@ -76,15 +93,33 @@ Tensor FFModel::layer_norm(const Tensor input,
       assert(false && "axes must be the last axes.size() dimensions");
     }
   }
+#endif
+  if (data_type == DT_NONE) {
+    data_type = input->data_type;
+  }
   int num_weights = elementwise_affine ? 2 : 0;
-  Layer *ln = new Layer(this,
-                        OP_LAYERNORM,
-                        DT_FLOAT,
-                        name,
-                        1 /*inputs*/,
-                        num_weights,
-                        1 /*outputs*/,
-                        input);
+  Layer *ln = nullptr;
+  if (data_type != input->data_type) {
+    Tensor casted_input = cast(input, data_type, "type cast for layer_norm");
+    ln = new Layer(this,
+                   OP_LAYERNORM,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   num_weights,
+                   1 /*outputs*/,
+                   casted_input);
+  } else {
+    ln = new Layer(this,
+                   OP_LAYERNORM,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   num_weights,
+                   1 /*outputs*/,
+                   input);
+  }
+
   ln->outputs[0] = create_tensor_legion_ordering(input->num_dims,
                                                  input->dims,
                                                  input->data_type,
@@ -92,19 +127,19 @@ Tensor FFModel::layer_norm(const Tensor input,
                                                  0,
                                                  true /*create_grad*/);
   if (num_weights == 2) {
-    int M = 1;
-    for (int i = 0; i < axes.size(); i++) {
-      M *= input->dims[input->num_dims - 1 - axes[i]];
+    int numdims = axes.size();
+    int dims[numdims];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[axes[i]];
     }
-    int dims[1] = {M};
-    ln->weights[0] = create_weight_legion_ordering(1,
+    ln->weights[0] = create_weight_legion_ordering(numdims,
                                                    dims,
                                                    input->data_type,
                                                    ln,
                                                    true /*create_grad*/,
                                                    nullptr,
                                                    CHOSEN_SYNC_TYPE);
-    ln->weights[1] = create_weight_legion_ordering(1,
+    ln->weights[1] = create_weight_legion_ordering(numdims,
                                                    dims,
                                                    input->data_type,
                                                    ln,
@@ -179,19 +214,93 @@ LayerNorm::LayerNorm(FFModel &model,
   ParallelDim output_dims[MAX_TENSOR_DIM];
   int M = 1;
   for (int i = 0; i < axes.size(); i++) {
-    M *= inputs[0]->dims[inputs[0]->num_dims - 1 - axes[i]].size;
+    M *= inputs[0]->dims[axes[i]].size;
+  }
+  int num_replicas = 1;
+  for (int i = 0; i < inputs[0]->num_dims; i++) {
+    if (inputs[0]->dims[i].is_replica_dim) {
+      num_replicas *= inputs[0]->dims[i].size;
+    }
   }
   effective_num_elements = M;
-  effective_batch_size = inputs[0]->get_volume() / M;
+  effective_batch_size = (inputs[0]->get_volume() / num_replicas) / M;
+  assert(elementwise_affine == (numWeights == 2));
   if (numWeights > 0 && allocate_weights) {
-    int kernel_dims = 2;
-    assert(false);
-    // weights[0] = model.create_parallel_weight_legion_ordering(
-    //     kernel_dims,
-  } else {
-    // do nothing
+    ParallelTensorShape beta_gamma_shape = _input->get_shape();
+    for (int i = axes.size(); i < beta_gamma_shape.num_dims - 1; i++) {
+      beta_gamma_shape.dims[i].size = 1;
+    }
+    int seed = std::rand();
+    Initializer *gamma_initializer = new UniformInitializer(seed, 1.0f, 1.0f);
+    Initializer *beta_initializer = new UniformInitializer(seed, 0.0f, 0.0f);
+    weights[0] = model.create_parallel_weight_legion_ordering(
+        beta_gamma_shape.num_dims, // axes.size(),
+        beta_gamma_shape.dims,
+        _input->data_type,
+        NULL /*owner_op*/,
+        true /*create_grad*/,
+        gamma_initializer,
+        CHOSEN_SYNC_TYPE);
+    weights[1] = model.create_parallel_weight_legion_ordering(
+        beta_gamma_shape.num_dims, //.size(),
+        beta_gamma_shape.dims,
+        _input->data_type,
+        NULL /*owner_op*/,
+        true /*create_grad*/,
+        beta_initializer,
+        CHOSEN_SYNC_TYPE);
   }
-  return;
+}
+
+void LayerNorm::init_inference(FFModel const &ff,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(LAYERNORM_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(LayerNorm)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  if (elementwise_affine) {
+    launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[0]->region));
+    launcher.add_field(2, FID_DATA);
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(3, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
 }
 
 void LayerNorm::init(FFModel const &ff) {
@@ -221,6 +330,20 @@ void LayerNorm::init(FFModel const &ff) {
                                                     EXCLUSIVE,
                                                     inputs[0]->region));
   launcher.add_field(1, FID_DATA);
+  if (elementwise_affine) {
+    launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[0]->region));
+    launcher.add_field(2, FID_DATA);
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(3, FID_DATA);
+  }
   FutureMap fm = runtime->execute_index_space(ctx, launcher);
   fm.wait_all_results();
   set_opmeta_from_futuremap(ff, fm);
@@ -232,7 +355,14 @@ OpMeta *LayerNorm::init_task(Task const *task,
                              Runtime *runtime) {
   LayerNorm *ln = (LayerNorm *)task->args;
   FFHandler handle = *((FFHandler const *)task->local_args);
-  LayerNormMeta *meta = new LayerNormMeta(handle, ln);
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  LayerNormMeta *meta = new LayerNormMeta(handle, ln, gpu_mem_allocator);
+  meta->input_type[0] = ln->inputs[0]->data_type;
+  meta->output_type[0] = ln->outputs[0]->data_type;
   return meta;
 }
 
@@ -264,13 +394,13 @@ void LayerNorm::forward(FFModel const &ff) {
   if (elementwise_affine) {
     launcher.add_region_requirement(RegionRequirement(weights[0]->part,
                                                       0 /*projection id*/,
-                                                      READ_WRITE,
+                                                      READ_ONLY,
                                                       EXCLUSIVE,
                                                       weights[0]->region));
     launcher.add_field(2, FID_DATA);
     launcher.add_region_requirement(RegionRequirement(weights[1]->part,
                                                       0 /*projection id*/,
-                                                      READ_WRITE,
+                                                      READ_ONLY,
                                                       EXCLUSIVE,
                                                       weights[1]->region));
     launcher.add_field(3, FID_DATA);
@@ -278,6 +408,57 @@ void LayerNorm::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap LayerNorm::inference(FFModel const &ff,
+                               BatchConfigFuture const &bc,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "LayerNorm op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(LAYERNORM_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  if (elementwise_affine) {
+    launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[0]->region));
+    launcher.add_field(2, FID_DATA);
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(3, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 /*
   regions[0](I): input
   regions[1](O): output
@@ -292,14 +473,21 @@ void LayerNorm::forward_task(Task const *task,
   assert(task->regions.size() == regions.size());
   float const *in_ptr = NULL;
   float *out_ptr = NULL, *gamma_ptr = NULL, *beta_ptr = NULL;
+  GenericTensorAccessorR in, gamma, beta;
+  GenericTensorAccessorW out;
+
   Domain in_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
-  in_ptr = helperGetTensorPointerRO<float>(
-      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  // in_ptr = helperGetTensorPointerRO<float>(
+  //     regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  in = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
   Domain out_domain = runtime->get_index_space_domain(
       ctx, task->regions[1].region.get_index_space());
-  out_ptr = helperGetTensorPointerWO<float>(
-      regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  // out_ptr = helperGetTensorPointerWO<float>(
+  //     regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  out = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
   assert(in_domain == out_domain);
   assert(in_domain.get_volume() ==
          m->effective_num_elements * m->effective_batch_size);
@@ -307,20 +495,32 @@ void LayerNorm::forward_task(Task const *task,
     assert(regions.size() == 4);
     Domain gamma_domain = runtime->get_index_space_domain(
         ctx, task->regions[2].region.get_index_space());
-    gamma_ptr = helperGetTensorPointerRW<float>(
-        regions[2], task->regions[2], FID_DATA, ctx, runtime);
+    // gamma_ptr = helperGetTensorPointerRW<float>(
+    //     regions[2], task->regions[2], FID_DATA, ctx, runtime);
+    gamma = helperGetGenericTensorAccessorRO(
+        m->input_type[0], regions[2], task->regions[2], FID_DATA, ctx, runtime);
     Domain beta_domain = runtime->get_index_space_domain(
         ctx, task->regions[3].region.get_index_space());
-    beta_ptr = helperGetTensorPointerRW<float>(
-        regions[3], task->regions[3], FID_DATA, ctx, runtime);
+    // beta_ptr = helperGetTensorPointerRW<float>(
+    //     regions[3], task->regions[3], FID_DATA, ctx, runtime);
+    beta = helperGetGenericTensorAccessorRO(
+        m->input_type[0], regions[3], task->regions[3], FID_DATA, ctx, runtime);
     assert(gamma_domain == beta_domain);
     assert(gamma_domain.get_volume() == m->effective_num_elements);
+    int numdims = gamma_domain.get_dim();
+    size_t vol = 1;
+    int i = 0;
+    while (vol < gamma_domain.get_volume()) {
+      int g_d = gamma_domain.hi()[i] - gamma_domain.lo()[i] + 1;
+      int in_d = in_domain.hi()[i] - in_domain.lo()[i] + 1;
+      assert(g_d == in_d);
+      vol *= g_d;
+      i++;
+    }
   } else {
     assert(regions.size() == 2);
   }
-
-  LayerNorm::forward_kernel_wrapper<float>(
-      m, in_ptr, out_ptr, gamma_ptr, beta_ptr);
+  LayerNorm::forward_kernel_wrapper(m, in, out, gamma, beta);
 }
 
 void LayerNorm::backward(FFModel const &ff) {
@@ -454,19 +654,26 @@ bool LayerNorm::measure_operator_cost(Simulator *sim,
   if (!inputs[0]->get_sub_tensor(mv, sub_input)) {
     return false;
   }
-  LayerNormMeta *m = new LayerNormMeta(sim->handler, this);
+  Domain input_domain = sub_input.get_domain();
+  Domain output_domain = sub_output.get_domain();
+  LayerNormMeta *m = sim->layernorm_meta;
 
   sim->free_all();
   float *in_ptr = (float *)sim->allocate(sub_input.get_volume(), DT_FLOAT);
   assert(in_ptr != NULL);
+  GenericTensorAccessorR input1_acc(inputs[0]->data_type, input_domain, in_ptr);
   cost_metrics.inputs_memory += cost_metrics.total_mem_diff_from(sim->offset);
 
   float *out_ptr = (float *)sim->allocate(sub_output.get_volume(), DT_FLOAT);
   assert(out_ptr != NULL);
+  GenericTensorAccessorW output_acc(
+      outputs[0]->data_type, output_domain, out_ptr);
   cost_metrics.outputs_memory += cost_metrics.total_mem_diff_from(sim->offset);
 
   // FIXME please add gamma_ptr and beta_ptr after finish the implementation
   float *gamma_ptr = NULL, *beta_ptr = NULL;
+  GenericTensorAccessorW gamma_acc;
+  GenericTensorAccessorW beta_acc;
 
   bool out_of_memory =
       (in_ptr == NULL) || (out_ptr == NULL) ||
@@ -479,7 +686,7 @@ bool LayerNorm::measure_operator_cost(Simulator *sim,
 
   std::function<void()> forward, backward;
   forward = [&] {
-    forward_kernel_wrapper(m, in_ptr, out_ptr, gamma_ptr, beta_ptr);
+    forward_kernel_wrapper(m, input1_acc, output_acc, gamma_acc, beta_acc);
   };
 
   if (sim->computationMode == COMP_MODE_TRAINING) {
@@ -538,6 +745,7 @@ bool LayerNorm::measure_operator_cost(Simulator *sim,
 
 void LayerNorm::serialize(Legion::Serializer &sez) const {
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
   sez.serialize(this->axes.size());
   for (size_t i = 0; i < this->axes.size(); i++) {
     sez.serialize(this->axes[i]);
@@ -557,9 +765,10 @@ Node LayerNorm::deserialize(FFModel &ff,
   std::vector<int> axes;
   bool elementwise_affine;
   float eps;
-  size_t id;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
   dez.deserialize(num_axes);
   for (size_t i = 0; i < num_axes; i++) {
     int axis_idx;
diff --git a/src/ops/layer_norm.cpp b/src/ops/layer_norm.cpp
index c3030e20b4..855f7296e8 100644
--- a/src/ops/layer_norm.cpp
+++ b/src/ops/layer_norm.cpp
@@ -24,7 +24,9 @@ constexpr int kCUDABlockReduceNumThreads = 512;
 constexpr int kCUDANumThreads = 256;
 constexpr int kColwiseReduceTileSize = 32;
 
-LayerNormMeta::LayerNormMeta(FFHandler handle, LayerNorm const *ln)
+LayerNormMeta::LayerNormMeta(FFHandler handle,
+                             LayerNorm const *ln,
+                             MemoryAllocator &gpu_mem_allocator)
     : OpMeta(handle) {
   elementwise_affine = ln->elementwise_affine;
   effective_batch_size = ln->effective_batch_size;
@@ -38,6 +40,8 @@ LayerNormMeta::LayerNormMeta(FFHandler handle, LayerNorm const *ln)
   checkCUDA(hipMalloc(&bias_ptr, sizeof(float) * effective_batch_size));
 }
 
+LayerNormMeta::~LayerNormMeta(void) {}
+
 template <typename T>
 __device__ __forceinline__ T WARP_SHFL_DOWN(T value,
                                             unsigned int delta,
@@ -79,26 +83,26 @@ __inline__ __device__ T BlockReduceSum(T val, T *shared) {
 }
 
 template <typename T>
-__global__ void
-    RowwiseMomentsCUDAKernel(int64_t N, T eps, T const *X, T *mean, T *rstd) {
-  __shared__ T m_shared[C10_WARP_SIZE];
-  __shared__ T v_shared[C10_WARP_SIZE];
+__global__ void RowwiseMomentsCUDAKernel(
+    int64_t N, float eps, T const *X, T *mean, T *rstd) {
+  __shared__ float m_shared[C10_WARP_SIZE];
+  __shared__ float v_shared[C10_WARP_SIZE];
   const int64_t i = blockIdx.x;
-  T sum1 = 0;
-  T sum2 = 0;
+  float sum1 = 0.0f;
+  float sum2 = 0.0f;
   for (int64_t j = threadIdx.x; j < N; j += blockDim.x) {
     const int64_t index = i * N + j;
-    sum1 += static_cast<T>(X[index]);
-    sum2 += static_cast<T>(X[index]) * static_cast<T>(X[index]);
+    sum1 += static_cast<float>(X[index]);
+    sum2 += static_cast<float>(X[index]) * static_cast<float>(X[index]);
   }
-  sum1 = BlockReduceSum<T>(sum1, m_shared);
-  sum2 = BlockReduceSum<T>(sum2, v_shared);
+  sum1 = BlockReduceSum<float>(sum1, m_shared);
+  sum2 = BlockReduceSum<float>(sum2, v_shared);
   if (threadIdx.x == 0) {
-    const T scale = T(1) / static_cast<T>(N);
+    float const scale = float(1) / static_cast<float>(N);
     sum1 *= scale;
-    sum2 = max(sum2 * scale - sum1 * sum1, T(0));
-    mean[i] = sum1;
-    rstd[i] = rsqrt(sum2 + static_cast<T>(eps));
+    sum2 = max(sum2 * scale - sum1 * sum1, float(0));
+    mean[i] = static_cast<T>(sum1);
+    rstd[i] = static_cast<T>(rsqrt(sum2 + eps));
   }
 }
 
@@ -129,10 +133,10 @@ template <typename T>
 void LayerNorm::forward_kernel(LayerNormMeta const *m,
                                T const *in_ptr,
                                T *out_ptr,
-                               T *gamma_ptr,
-                               T *beta_ptr,
+                               T const *gamma_ptr,
+                               T const *beta_ptr,
                                hipStream_t stream) {
-  hipLaunchKernelGGL(HIP_KERNEL_NAME(RowwiseMomentsCUDAKernel<float>),
+  hipLaunchKernelGGL(HIP_KERNEL_NAME(RowwiseMomentsCUDAKernel<T>),
                      m->effective_batch_size,
                      kCUDABlockReduceNumThreads,
                      0,
@@ -140,33 +144,47 @@ void LayerNorm::forward_kernel(LayerNormMeta const *m,
                      m->effective_num_elements,
                      m->eps,
                      in_ptr,
-                     m->mean_ptr,
-                     m->rstd_ptr);
-  hipLaunchKernelGGL(HIP_KERNEL_NAME(LayerNormForwardCUDAKernel<float>),
+                     static_cast<T *>(m->mean_ptr),
+                     static_cast<T *>(m->rstd_ptr));
+  hipLaunchKernelGGL(HIP_KERNEL_NAME(LayerNormForwardCUDAKernel<T>),
                      m->effective_batch_size,
                      kCUDANumThreads,
                      0,
                      stream,
                      m->effective_num_elements,
                      in_ptr,
-                     m->mean_ptr,
-                     m->rstd_ptr,
+                     static_cast<T *>(m->mean_ptr),
+                     static_cast<T *>(m->rstd_ptr),
                      gamma_ptr,
                      beta_ptr,
                      out_ptr);
 }
 
 /*static*/
-template <typename T>
 void LayerNorm::forward_kernel_wrapper(LayerNormMeta const *m,
-                                       T const *in_ptr,
-                                       T *out_ptr,
-                                       T *gamma_ptr,
-                                       T *beta_ptr) {
+                                       GenericTensorAccessorR const &input,
+                                       GenericTensorAccessorW &output,
+                                       GenericTensorAccessorR const &gamma,
+                                       GenericTensorAccessorR const &beta) {
   hipStream_t stream;
   checkCUDA(get_legion_stream(&stream));
-  LayerNorm::forward_kernel<float>(
-      m, in_ptr, out_ptr, gamma_ptr, beta_ptr, stream);
+  if (m->input_type[0] == DT_FLOAT) {
+    LayerNorm::forward_kernel<float>(m,
+                                     input.get_float_ptr(),
+                                     output.get_float_ptr(),
+                                     gamma.get_float_ptr(),
+                                     beta.get_float_ptr(),
+                                     stream);
+  } else if (m->input_type[0] == DT_HALF) {
+    LayerNorm::forward_kernel<half>(m,
+                                    input.get_half_ptr(),
+                                    output.get_half_ptr(),
+                                    gamma.get_half_ptr(),
+                                    beta.get_half_ptr(),
+                                    stream);
+  } else {
+    assert(false && "unsupport datatype in layernorm");
+  }
 }
 
 template <typename T>
@@ -367,8 +385,8 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
                      output_grad_ptr,
                      input_ptr,
                      gamma_ptr,
-                     m->ds_ptr,
-                     m->db_ptr);
+                     static_cast<T *>(m->ds_ptr),
+                     static_cast<T *>(m->db_ptr));
   const int64_t B = (M + kCUDANumThreads - 1) / kCUDANumThreads;
   hipLaunchKernelGGL(HIP_KERNEL_NAME(ComputeGradientFusedParamsCUDAKernel<T>),
                      B,
@@ -377,12 +395,12 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
                      stream,
                      M,
                      N,
-                     m->mean_ptr,
-                     m->rstd_ptr,
-                     m->ds_ptr,
-                     m->db_ptr,
-                     m->scale_ptr,
-                     m->bias_ptr);
+                     static_cast<T *>(m->mean_ptr),
+                     static_cast<T *>(m->rstd_ptr),
+                     static_cast<T *>(m->ds_ptr),
+                     static_cast<T *>(m->db_ptr),
+                     static_cast<T *>(m->scale_ptr),
+                     static_cast<T *>(m->bias_ptr));
   if (gamma_grad_ptr != NULL || beta_grad_ptr != NULL) {
     if (M < 512) {
       // For small batch size, do colwise reduce directly
@@ -396,8 +414,8 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
                          N,
                          output_grad_ptr,
                          input_ptr,
-                         m->mean_ptr,
-                         m->rstd_ptr,
+                         static_cast<T *>(m->mean_ptr),
+                         static_cast<T *>(m->rstd_ptr),
                          gamma_grad_ptr,
                          beta_grad_ptr);
     } else {
@@ -414,8 +432,8 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
                          N,
                          output_grad_ptr,
                          input_ptr,
-                         m->mean_ptr,
-                         m->rstd_ptr,
+                         static_cast<T *>(m->mean_ptr),
+                         static_cast<T *>(m->rstd_ptr),
                          gamma_grad_ptr,
                          beta_grad_ptr);
     }
@@ -443,11 +461,6 @@ void LayerNorm::backward_kernel_wrapper(LayerNormMeta const *m,
                                     stream);
 }
 
-template void LayerNorm::forward_kernel_wrapper<float>(LayerNormMeta const *m,
-                                                       float const *in_ptr,
-                                                       float *out_ptr,
-                                                       float *gamma_ptr,
-                                                       float *beta_ptr);
 template void
     LayerNorm::backward_kernel_wrapper<float>(LayerNormMeta const *m,
                                               float const *output_grad_ptr,
diff --git a/src/ops/layer_norm.cu b/src/ops/layer_norm.cu
index ac477ba2ad..f594f8f7a8 100644
--- a/src/ops/layer_norm.cu
+++ b/src/ops/layer_norm.cu
@@ -13,6 +13,7 @@
  * limitations under the License.
  */
 
+#include "flexflow/ffconst_utils.h"
 #include "flexflow/ops/layer_norm.h"
 #include "flexflow/utils/cuda_helper.h"
 
@@ -23,19 +24,36 @@ constexpr int kCUDABlockReduceNumThreads = 512;
 constexpr int kCUDANumThreads = 256;
 constexpr int kColwiseReduceTileSize = 32;
 
-LayerNormMeta::LayerNormMeta(FFHandler handle, LayerNorm const *ln)
+LayerNormMeta::LayerNormMeta(FFHandler handle,
+                             LayerNorm const *ln,
+                             MemoryAllocator &gpu_mem_allocator)
     : OpMeta(handle) {
   elementwise_affine = ln->elementwise_affine;
   effective_batch_size = ln->effective_batch_size;
   effective_num_elements = ln->effective_num_elements;
   profiling = ln->profiling;
   eps = ln->eps;
-  checkCUDA(cudaMalloc(&mean_ptr, sizeof(float) * effective_batch_size));
-  checkCUDA(cudaMalloc(&rstd_ptr, sizeof(float) * effective_batch_size));
-  checkCUDA(cudaMalloc(&ds_ptr, sizeof(float) * effective_batch_size));
-  checkCUDA(cudaMalloc(&db_ptr, sizeof(float) * effective_batch_size));
-  checkCUDA(cudaMalloc(&scale_ptr, sizeof(float) * effective_batch_size));
-  checkCUDA(cudaMalloc(&bias_ptr, sizeof(float) * effective_batch_size));
+  DataType data_type = ln->data_type;
+  size_t totalSize = effective_batch_size * data_type_size(data_type) * 6;
+  gpu_mem_allocator.create_legion_instance(reserveInst, totalSize);
+  mean_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * effective_batch_size);
+  rstd_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * effective_batch_size);
+  ds_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * effective_batch_size);
+  db_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * effective_batch_size);
+  scale_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * effective_batch_size);
+  bias_ptr = gpu_mem_allocator.allocate_instance_untyped(
+      data_type_size(data_type) * effective_batch_size);
+}
+
+LayerNormMeta::~LayerNormMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
 }
 
 template <typename T>
@@ -77,26 +95,26 @@ __inline__ __device__ T BlockReduceSum(T val, T *shared) {
 }
 
 template <typename T>
-__global__ void
-    RowwiseMomentsCUDAKernel(int64_t N, T eps, T const *X, T *mean, T *rstd) {
-  __shared__ T m_shared[C10_WARP_SIZE];
-  __shared__ T v_shared[C10_WARP_SIZE];
+__global__ void RowwiseMomentsCUDAKernel(
+    int64_t N, float eps, T const *X, T *mean, T *rstd) {
+  __shared__ float m_shared[C10_WARP_SIZE];
+  __shared__ float v_shared[C10_WARP_SIZE];
   const int64_t i = blockIdx.x;
-  T sum1 = 0;
-  T sum2 = 0;
+  float sum1 = 0.0f;
+  float sum2 = 0.0f;
   for (int64_t j = threadIdx.x; j < N; j += blockDim.x) {
     const int64_t index = i * N + j;
-    sum1 += static_cast<T>(X[index]);
-    sum2 += static_cast<T>(X[index]) * static_cast<T>(X[index]);
+    sum1 += static_cast<float>(X[index]);
+    sum2 += static_cast<float>(X[index]) * static_cast<float>(X[index]);
   }
-  sum1 = BlockReduceSum<T>(sum1, m_shared);
-  sum2 = BlockReduceSum<T>(sum2, v_shared);
+  sum1 = BlockReduceSum<float>(sum1, m_shared);
+  sum2 = BlockReduceSum<float>(sum2, v_shared);
   if (threadIdx.x == 0) {
-    const T scale = T(1) / static_cast<T>(N);
+    float const scale = float(1) / static_cast<float>(N);
     sum1 *= scale;
-    sum2 = max(sum2 * scale - sum1 * sum1, T(0));
-    mean[i] = sum1;
-    rstd[i] = rsqrt(sum2 + static_cast<T>(eps));
+    sum2 = max(sum2 * scale - sum1 * sum1, float(0));
+    mean[i] = static_cast<T>(sum1);
+    rstd[i] = static_cast<T>(rsqrt(sum2 + eps));
   }
 }
 
@@ -127,30 +145,33 @@ template <typename T>
 void LayerNorm::forward_kernel(LayerNormMeta const *m,
                                T const *in_ptr,
                                T *out_ptr,
-                               T *gamma_ptr,
-                               T *beta_ptr,
+                               T const *gamma_ptr,
+                               T const *beta_ptr,
                                cudaStream_t stream) {
-  RowwiseMomentsCUDAKernel<float>
+  RowwiseMomentsCUDAKernel<T>
       <<<m->effective_batch_size, kCUDABlockReduceNumThreads, 0, stream>>>(
-          m->effective_num_elements, m->eps, in_ptr, m->mean_ptr, m->rstd_ptr);
-  LayerNormForwardCUDAKernel<float>
+          m->effective_num_elements,
+          m->eps,
+          in_ptr,
+          static_cast<T *>(m->mean_ptr),
+          static_cast<T *>(m->rstd_ptr));
+  LayerNormForwardCUDAKernel<T>
       <<<m->effective_batch_size, kCUDANumThreads, 0, stream>>>(
           m->effective_num_elements,
           in_ptr,
-          m->mean_ptr,
-          m->rstd_ptr,
+          static_cast<T *>(m->mean_ptr),
+          static_cast<T *>(m->rstd_ptr),
           gamma_ptr,
           beta_ptr,
           out_ptr);
 }
 
 /*static*/
-template <typename T>
 void LayerNorm::forward_kernel_wrapper(LayerNormMeta const *m,
-                                       T const *in_ptr,
-                                       T *out_ptr,
-                                       T *gamma_ptr,
-                                       T *beta_ptr) {
+                                       GenericTensorAccessorR const &input,
+                                       GenericTensorAccessorW &output,
+                                       GenericTensorAccessorR const &gamma,
+                                       GenericTensorAccessorR const &beta) {
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
 
@@ -160,8 +181,24 @@ void LayerNorm::forward_kernel_wrapper(LayerNormMeta const *m,
     cudaEventCreate(&t_end);
     cudaEventRecord(t_start, stream);
   }
-  LayerNorm::forward_kernel<float>(
-      m, in_ptr, out_ptr, gamma_ptr, beta_ptr, stream);
+  if (m->input_type[0] == DT_FLOAT) {
+    LayerNorm::forward_kernel<float>(m,
+                                     input.get_float_ptr(),
+                                     output.get_float_ptr(),
+                                     gamma.get_float_ptr(),
+                                     beta.get_float_ptr(),
+                                     stream);
+  } else if (m->input_type[0] == DT_HALF) {
+    LayerNorm::forward_kernel<half>(m,
+                                    input.get_half_ptr(),
+                                    output.get_half_ptr(),
+                                    gamma.get_half_ptr(),
+                                    beta.get_half_ptr(),
+                                    stream);
+  } else {
+    assert(false && "unsupport datatype in layernorm");
+  }
+
   if (m->profiling) {
     cudaEventRecord(t_end, stream);
     checkCUDA(cudaEventSynchronize(t_end));
@@ -170,8 +207,8 @@ void LayerNorm::forward_kernel_wrapper(LayerNormMeta const *m,
     cudaEventDestroy(t_start);
     cudaEventDestroy(t_end);
     printf("[LayerNorm] forward time (CF) = %.2fms\n", elapsed);
-    print_tensor<T>(in_ptr, 32, "[LayerNorm:forward:input]");
-    print_tensor<T>(out_ptr, 32, "[LayerNorm:forward:output]");
+    // print_tensor<T>(in_ptr, 32, "[LayerNorm:forward:input]");
+    // print_tensor<T>(out_ptr, 32, "[LayerNorm:forward:output]");
   }
 }
 
@@ -366,17 +403,22 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
   const int64_t N = m->effective_num_elements;
   ComputeInternalGradientsCUDAKernel<T>
       <<<M, kCUDABlockReduceNumThreads, 0, stream>>>(
-          N, output_grad_ptr, input_ptr, gamma_ptr, m->ds_ptr, m->db_ptr);
+          N,
+          output_grad_ptr,
+          input_ptr,
+          gamma_ptr,
+          static_cast<T *>(m->ds_ptr),
+          static_cast<T *>(m->db_ptr));
   const int64_t B = (M + kCUDANumThreads - 1) / kCUDANumThreads;
   ComputeGradientFusedParamsCUDAKernel<T>
       <<<B, kCUDANumThreads, 0, stream>>>(M,
                                           N,
-                                          m->mean_ptr,
-                                          m->rstd_ptr,
-                                          m->ds_ptr,
-                                          m->db_ptr,
-                                          m->scale_ptr,
-                                          m->bias_ptr);
+                                          static_cast<T *>(m->mean_ptr),
+                                          static_cast<T *>(m->rstd_ptr),
+                                          static_cast<T *>(m->ds_ptr),
+                                          static_cast<T *>(m->db_ptr),
+                                          static_cast<T *>(m->scale_ptr),
+                                          static_cast<T *>(m->bias_ptr));
   if (gamma_grad_ptr != NULL || beta_grad_ptr != NULL) {
     if (M < 512) {
       // For small batch size, do colwise reduce directly
@@ -386,8 +428,8 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
                                               N,
                                               output_grad_ptr,
                                               input_ptr,
-                                              m->mean_ptr,
-                                              m->rstd_ptr,
+                                              static_cast<T *>(m->mean_ptr),
+                                              static_cast<T *>(m->rstd_ptr),
                                               gamma_grad_ptr,
                                               beta_grad_ptr);
     } else {
@@ -396,14 +438,15 @@ void LayerNorm::backward_kernel(LayerNormMeta const *m,
       constexpr int kThreadX = kColwiseReduceTileSize;
       constexpr int kThreadY = kColwiseReduceTileSize / 2;
       GammaBetaBackwardCUDAKernel<T>
-          <<<B, dim3(kThreadX, kThreadY), 0, stream>>>(M,
-                                                       N,
-                                                       output_grad_ptr,
-                                                       input_ptr,
-                                                       m->mean_ptr,
-                                                       m->rstd_ptr,
-                                                       gamma_grad_ptr,
-                                                       beta_grad_ptr);
+          <<<B, dim3(kThreadX, kThreadY), 0, stream>>>(
+              M,
+              N,
+              output_grad_ptr,
+              input_ptr,
+              static_cast<T *>(m->mean_ptr),
+              static_cast<T *>(m->rstd_ptr),
+              gamma_grad_ptr,
+              beta_grad_ptr);
     }
   }
 }
@@ -419,21 +462,28 @@ void LayerNorm::backward_kernel_wrapper(LayerNormMeta const *m,
                                         T *beta_grad_ptr) {
   cudaStream_t stream;
   checkCUDA(get_legion_stream(&stream));
-  LayerNorm::backward_kernel<float>(m,
-                                    output_grad_ptr,
-                                    input_ptr,
-                                    input_grad_ptr,
-                                    gamma_ptr,
-                                    gamma_grad_ptr,
-                                    beta_grad_ptr,
-                                    stream);
+  if (m->output_type[0] == DT_FLOAT) {
+    LayerNorm::backward_kernel<float>(m,
+                                      output_grad_ptr,
+                                      input_ptr,
+                                      input_grad_ptr,
+                                      gamma_ptr,
+                                      gamma_grad_ptr,
+                                      beta_grad_ptr,
+                                      stream);
+  }
+  // }else if(m->output_type[0] == DT_HALF){
+  //   LayerNorm::backward_kernel<half>(m,
+  //                                   output_grad_ptr,
+  //                                   input_ptr,
+  //                                   input_grad_ptr,
+  //                                   gamma_ptr,
+  //                                   gamma_grad_ptr,
+  //                                   beta_grad_ptr,
+  //                                   stream);
+  // }
 }
 
-template void LayerNorm::forward_kernel_wrapper<float>(LayerNormMeta const *m,
-                                                       float const *in_ptr,
-                                                       float *out_ptr,
-                                                       float *gamma_ptr,
-                                                       float *beta_ptr);
 template void
     LayerNorm::backward_kernel_wrapper<float>(LayerNormMeta const *m,
                                               float const *output_grad_ptr,
diff --git a/src/ops/linear.cc b/src/ops/linear.cc
index 4c9bb8dc51..8eb3db2869 100644
--- a/src/ops/linear.cc
+++ b/src/ops/linear.cc
@@ -1,4 +1,5 @@
 #include "flexflow/ops/linear.h"
+#include "flexflow/ffconst_utils.h"
 #include "flexflow/layer.h"
 #include "flexflow/model.h"
 #include "flexflow/ops/kernels/linear_kernels.h"
@@ -12,9 +13,12 @@ using Legion::ArgumentMap;
 using Legion::Context;
 using Legion::coord_t;
 using Legion::Domain;
+using Legion::Future;
 using Legion::FutureMap;
 using Legion::IndexLauncher;
 using Legion::InlineLauncher;
+using Legion::Machine;
+using Legion::Memory;
 using Legion::PhysicalRegion;
 using Legion::Predicate;
 using Legion::Rect;
@@ -40,14 +44,33 @@ Tensor FFModel::dense(const Tensor input,
                       RegularizerMode kernel_reg_type,
                       float kernel_reg_lambda,
                       char const *name) {
-  Layer *li = new Layer(this,
-                        OP_LINEAR,
-                        data_type,
-                        name,
-                        1 /*inputs*/,
-                        use_bias ? 2 : 1 /*weights*/,
-                        1 /*outputs*/,
-                        input);
+  if (data_type == DT_NONE) {
+    data_type = input->data_type;
+  }
+  DataType quantization_type = cpu_offload ? config.quantization_type : DT_NONE;
+  bool offload = cpu_offload;
+  Layer *li = nullptr;
+  if (data_type != input->data_type) {
+    Tensor casted_input = cast(input, data_type, "type cast for dense");
+    li = new Layer(this,
+                   OP_LINEAR,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   use_bias ? 2 : 1 /*weights*/,
+                   1 /*outputs*/,
+                   casted_input);
+  } else {
+    li = new Layer(this,
+                   OP_LINEAR,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   use_bias ? 2 : 1 /*weights*/,
+                   1 /*outputs*/,
+                   input);
+  }
+
   {
     int numdims = input->num_dims;
     int dims[MAX_TENSOR_DIM];
@@ -60,14 +83,18 @@ Tensor FFModel::dense(const Tensor input,
   }
   {
     int dims[2] = {input->dims[0], outDim};
-    li->weights[KERNEL_IDX] =
-        create_weight_legion_ordering(2,
-                                      dims,
-                                      data_type,
-                                      li,
-                                      true /*create_grad*/,
-                                      kernel_initializer,
-                                      CHOSEN_SYNC_TYPE);
+    if (quantization_type != DT_NONE) {
+      dims[0] =
+          get_quantization_to_byte_size(data_type, quantization_type, dims[0]);
+    }
+    li->weights[KERNEL_IDX] = create_weight_legion_ordering(
+        2,
+        dims,
+        quantization_type == DT_NONE ? data_type : quantization_type,
+        li,
+        true /*create_grad*/,
+        kernel_initializer,
+        CHOSEN_SYNC_TYPE);
   }
   if (use_bias) {
     int dims[1] = {outDim};
@@ -84,6 +111,8 @@ Tensor FFModel::dense(const Tensor input,
   li->add_int_property("activation", activation);
   li->add_int_property("kernel_reg_type", kernel_reg_type);
   li->add_float_property("kernel_reg_lambda", kernel_reg_lambda);
+  li->add_int_property("quantization_type", quantization_type);
+  li->add_int_property("offload", offload);
   layers.push_back(li);
   return li->outputs[0];
 }
@@ -103,6 +132,10 @@ Op *Linear::create_operator_from_layer(
   RegularizerMode kernel_reg_type = (RegularizerMode)value;
   float kernel_reg_lambda;
   layer->get_float_property("kernel_reg_lambda", kernel_reg_lambda);
+  layer->get_int_property("quantization_type", value);
+  DataType quantization_type = (DataType)value;
+  layer->get_int_property("offload", value);
+  bool offload = (bool)value;
   return new Linear(model,
                     layer->layer_guid,
                     inputs[0],
@@ -112,6 +145,8 @@ Op *Linear::create_operator_from_layer(
                     kernel_reg_lambda,
                     use_bias,
                     layer->data_type,
+                    quantization_type,
+                    offload,
                     false /*allocate_weights*/,
                     layer->name);
 }
@@ -133,6 +168,8 @@ Linear::Linear(FFModel &model,
              other.kernel_reg_lambda,
              other.use_bias,
              other.data_type,
+             other.quantization_type,
+             other.offload,
              allocate_weights,
              other.name) {}
 
@@ -150,6 +187,8 @@ Linear::Linear(FFModel &model,
              params.kernel_reg_lambda,
              params.use_bias,
              params.data_type,
+             params.quantization_type,
+             params.offload,
              allocate_weights,
              name) {}
 
@@ -162,6 +201,8 @@ Linear::Linear(FFModel &model,
                float _kernel_reg_lambda,
                bool _use_bias,
                DataType _data_type,
+               DataType _quantization_type,
+               bool _offload,
                bool allocate_weights,
                char const *name)
     : Op(model,
@@ -175,6 +216,7 @@ Linear::Linear(FFModel &model,
          _input),
       out_channels(out_dim), activation(_activation), use_bias(_use_bias),
       kernel_reg_type(_kernel_reg_type), kernel_reg_lambda(_kernel_reg_lambda),
+      quantization_type(_quantization_type), offload(_offload),
       replica(ParallelTensorBase::NO_TENSOR) {
   // overwrite layer_guid
   layer_guid = _layer_guid;
@@ -189,18 +231,37 @@ Linear::Linear(FFModel &model,
   LinearParams params = this->get_params();
   params.construct_mappings(*this->parallel_dims_mapping, input_shape);
   params.solve_dims(input_shape, output_shape, kernel_shape, bias_shape);
+  kernel_shape.dims[0].size = this->in_channels;
+  bias_shape.dims[0].degree = _input->dims[_input->num_dims - 1].degree;
+  bias_shape.dims[0].parallel_idx =
+      _input->dims[_input->num_dims - 1].parallel_idx;
+  bias_shape.dims[1].size = bias_shape.dims[1].degree = 1;
+  bias_shape.dims[1].parallel_idx = -1;
+  bias_shape.dims[bias_shape.num_dims - 1].size =
+      bias_shape.dims[bias_shape.num_dims - 1].degree = 1;
+  for (int i = 0; i < input_shape.num_dims - 1; i++) {
+    if (_input->dims[i].degree > 1) {
+      bias_shape.dims[bias_shape.num_dims - 1].size *= _input->dims[i].degree;
+      bias_shape.dims[bias_shape.num_dims - 1].degree *= _input->dims[i].degree;
+      bias_shape.dims[bias_shape.num_dims - 1].parallel_idx =
+          _input->dims[i].parallel_idx;
+    }
+  }
 
   if (allocate_weights) {
     Initializer *kernel_initializer = new GlorotUniform(std::rand() /*seed*/);
-
-    weights[KERNEL_IDX] =
-        model.create_parallel_weight_legion_ordering(kernel_shape.num_dims,
-                                                     kernel_shape.dims,
-                                                     _data_type,
-                                                     NULL /*owner_op*/,
-                                                     true /*create_grad*/,
-                                                     kernel_initializer,
-                                                     CHOSEN_SYNC_TYPE);
+    if (quantization_type != DT_NONE) {
+      kernel_shape.dims[0].size = get_quantization_to_byte_size(
+          data_type, quantization_type, kernel_shape.dims[0].size);
+    }
+    weights[KERNEL_IDX] = model.create_parallel_weight_legion_ordering(
+        kernel_shape.num_dims,
+        kernel_shape.dims,
+        quantization_type == DT_NONE ? _data_type : quantization_type,
+        NULL /*owner_op*/,
+        true /*create_grad*/,
+        kernel_initializer,
+        CHOSEN_SYNC_TYPE);
 
     if (use_bias) {
       Initializer *bias_initializer = new ZeroInitializer();
@@ -213,6 +274,7 @@ Linear::Linear(FFModel &model,
                                                        true /*create_grad*/,
                                                        bias_initializer,
                                                        CHOSEN_SYNC_TYPE);
+      add_bias_only_once = _input->dims[0].degree > 1;
     }
   }
 
@@ -220,7 +282,7 @@ Linear::Linear(FFModel &model,
   outputs[0] = model.create_parallel_tensor_legion_ordering(
       output_shape.num_dims, output_shape.dims, _data_type, this);
 
-  assert(check_output_input_weight_parallel_dims(allocate_weights));
+  // assert(check_output_input_weight_parallel_dims(allocate_weights));
 }
 
 void Linear::init(FFModel const &ff) {
@@ -243,18 +305,24 @@ void Linear::init(FFModel const &ff) {
   //     RegionRequirement(input_lps[0], 0/*projection id*/,
   //                       READ_ONLY, EXCLUSIVE, inputs[0]->region));
   // launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
   launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
                                                     0 /*projection id*/,
                                                     WRITE_ONLY,
                                                     EXCLUSIVE,
                                                     outputs[0]->region));
-  launcher.add_field(0, FID_DATA);
+  launcher.add_field(1, FID_DATA);
   launcher.add_region_requirement(RegionRequirement(weights[0]->part,
                                                     0 /*projection id*/,
                                                     READ_ONLY,
                                                     EXCLUSIVE,
                                                     weights[0]->region));
-  launcher.add_field(1, FID_DATA);
+  launcher.add_field(2, FID_DATA);
   // launcher.add_region_requirement(
   //     RegionRequirement(weights[1]->part, 0/*projection id*/,
   //                       READ_ONLY, EXCLUSIVE, weights[1]->region));
@@ -271,6 +339,67 @@ void Linear::init(FFModel const &ff) {
   set_opmeta_from_futuremap(ff, fm);
 }
 
+void Linear::init_inference(FFModel const &ff,
+                            std::vector<ParallelTensor> const &batch_inputs,
+                            std::vector<ParallelTensor> const &batch_outputs,
+                            MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  // assert(check_output_input_weight_same_machine_view());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(LINEAR_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Linear)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  // launcher.add_region_requirement(
+  //     RegionRequirement(input_lps[0], 0/*projection id*/,
+  //                       READ_ONLY, EXCLUSIVE, inputs[0]->region));
+  // launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(weights[0]->part,
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        weights[0]->region,
+                        ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+  launcher.add_field(2, FID_DATA);
+  // launcher.add_region_requirement(
+  //     RegionRequirement(weights[1]->part, 0/*projection id*/,
+  //                       READ_ONLY, EXCLUSIVE, weights[1]->region));
+  // launcher.add_field(3, FID_DATA);
+  if (ff.config.computationMode == COMP_MODE_TRAINING) {
+    // Add inputs[0].region_grad to avoid Legion warning
+    // launcher.add_region_requirement(
+    //    RegionRequirement(input_grad_lps[0], 0/*projection id*/,
+    //        WRITE_ONLY, EXCLUSIVE, inputs[0].region_grad));
+    // launcher.add_field(2, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 /*
   regions[0](O): output
   regions[1](I): kernel
@@ -280,12 +409,37 @@ OpMeta *Linear::init_task(Task const *task,
                           std::vector<PhysicalRegion> const &regions,
                           Context ctx,
                           Runtime *runtime) {
-  Domain out_domain = runtime->get_index_space_domain(
-      ctx, task->regions[0].region.get_index_space());
-  switch (out_domain.get_dim()) {
+  Linear const *linear = (Linear *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+  GenericTensorAccessorW output =
+      helperGetGenericTensorAccessorWO(linear->inputs[0]->data_type,
+                                       regions[0],
+                                       task->regions[0],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  switch (output.domain.get_dim()) {
 #define DIMFUNC(DIM)                                                           \
   case DIM:                                                                    \
-    return init_task_with_dim<DIM>(task, regions, ctx, runtime);
+    if (output.data_type == DT_HALF) {                                         \
+      if (linear->quantization_type != DT_NONE) {                              \
+        return init_task_with_dim<half, char, DIM>(                            \
+            task, regions, ctx, runtime);                                      \
+      } else {                                                                 \
+        return init_task_with_dim<half, half, DIM>(                            \
+            task, regions, ctx, runtime);                                      \
+      }                                                                        \
+    } else if (output.data_type == DT_FLOAT) {                                 \
+      if (linear->quantization_type != DT_NONE) {                              \
+        return init_task_with_dim<float, char, DIM>(                           \
+            task, regions, ctx, runtime);                                      \
+      } else {                                                                 \
+        return init_task_with_dim<float, float, DIM>(                          \
+            task, regions, ctx, runtime);                                      \
+      }                                                                        \
+    } else {                                                                   \
+      assert(false && "Unsupported data type");                                \
+    }
     LEGION_FOREACH_N(DIMFUNC)
 #undef DIMFUNC
     default:
@@ -294,7 +448,7 @@ OpMeta *Linear::init_task(Task const *task,
   return NULL;
 }
 
-template <int NDIM>
+template <typename DT, typename WT, int NDIM>
 OpMeta *Linear::init_task_with_dim(Task const *task,
                                    std::vector<PhysicalRegion> const &regions,
                                    Context ctx,
@@ -305,38 +459,55 @@ OpMeta *Linear::init_task_with_dim(Task const *task,
   FFHandler handle = *((FFHandler const *)task->local_args);
   // TensorAccessorR<float, 2> acc_input(
   //     regions[0], task->regions[0], FID_DATA, ctx, runtime);
-  TensorAccessorW<float, NDIM> acc_output(regions[0],
-                                          task->regions[0],
-                                          FID_DATA,
-                                          ctx,
-                                          runtime,
-                                          false /*readOutput*/);
-  TensorAccessorW<float, NDIM> acc_kernel(regions[1],
-                                          task->regions[1],
-                                          FID_DATA,
-                                          ctx,
-                                          runtime,
-                                          false /*readOutput*/);
+  TensorAccessorR<DT, NDIM> acc_input(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  TensorAccessorW<DT, NDIM> acc_output(regions[1],
+                                       task->regions[1],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime,
+                                       false /*readOutput*/);
+  TensorAccessorW<WT, NDIM> acc_kernel(regions[2],
+                                       task->regions[2],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime,
+                                       false /*readOutput*/);
+
   // TensorAccessorR<float, 1> acc_bias(
   //     regions[3], task->regions[3], FID_DATA, ctx, runtime);
-  // int in_dim = acc_input.rect.hi[0] - acc_input.rect.lo[0] + 1;
-  int in_dim = acc_kernel.rect.hi[0] - acc_kernel.rect.lo[0] + 1;
+  int in_dim = acc_input.rect.hi[0] - acc_input.rect.lo[0] + 1;
+  // int in_dim = acc_kernel.rect.hi[0] - acc_kernel.rect.lo[0] + 1;
   int out_dim = acc_output.rect.hi[0] - acc_output.rect.lo[0] + 1;
   int batch_size = acc_output.rect.volume() / out_dim;
-  printf("init linear (input): in_dim(%d) out_dim(%d) batch_size(%d)\n",
-         in_dim,
-         out_dim,
-         batch_size);
-  LinearMeta *m = new LinearMeta(handle, batch_size);
+  // printf("init linear (input): in_dim(%d) out_dim(%d) batch_size(%d)\n",
+  //        in_dim,
+  //        out_dim,
+  //        batch_size);
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  if (linear->offload) {
+    // cpu-offload enabled
+    // use offload_reserved_space
+    gpu_mem_allocator.register_reserved_work_space(
+        handle.offload_reserve_space, handle.offload_reserve_space_size);
+  }
+
+  LinearMeta *m = new LinearMeta(
+      handle, batch_size, linear, gpu_mem_allocator, in_dim * out_dim);
   m->activation = linear->activation;
   m->kernel_reg_type = linear->kernel_reg_type;
   m->kernel_reg_lambda = linear->kernel_reg_lambda;
   m->use_bias = linear->use_bias;
+  m->add_bias_only_once = linear->add_bias_only_once;
   m->profiling = linear->profiling;
   m->trainableInputs[0] = linear->trainableInputs[0];
-  m->input_type = linear->inputs[0]->data_type;
-  m->weight_type = linear->weights[0]->data_type;
-  m->output_type = linear->outputs[0]->data_type;
+  m->weight_ptr_type = m->input_type[0];
+  m->quantization_type = linear->quantization_type;
+  m->offload = linear->offload;
   std::strcpy(m->op_name, linear->name);
 
   init_kernel(m, batch_size, out_dim);
@@ -386,16 +557,139 @@ void Linear::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap Linear::inference(FFModel const &ff,
+                            BatchConfigFuture const &bc,
+                            std::vector<ParallelTensor> const &batch_inputs,
+                            std::vector<ParallelTensor> const &batch_outputs,
+                            MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "Linear op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(LINEAR_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(weights[0]->part,
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        weights[0]->region,
+                        ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+  launcher.add_field(2, FID_DATA);
+  if (use_bias) {
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(3, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+void Linear::inference_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  Domain input_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  LinearMeta const *m = *((LinearMeta **)task->local_args);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  assert(regions.size() == (3 + static_cast<size_t>(m->use_bias)));
+  assert(task->regions.size() == (3 + static_cast<size_t>(m->use_bias)));
+  if (m->quantization_type == DT_NONE) {
+    assert(m->input_type[0] == m->weight_type[0]);
+  }
+  assert(m->input_type[0] == m->output_type[0]);
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR weight = helperGetGenericTensorAccessorRO(
+      m->weight_type[0], regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  int in_dim = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+  int out_dim = output.domain.hi()[0] - output.domain.lo()[0] + 1;
+
+  int batch_size = bc->num_active_tokens();
+  GenericTensorAccessorR bias;
+  if (m->use_bias &&
+      !(m->add_bias_only_once && task->index_point.point_data[0] != 0)) {
+    bias = helperGetGenericTensorAccessorRO(m->weight_type[1],
+                                            regions[3],
+                                            task->regions[3],
+                                            FID_DATA,
+                                            ctx,
+                                            runtime);
+    assert(bias.domain.get_volume() == static_cast<size_t>(out_dim));
+  }
+  forward_kernel_wrapper(m,
+                         input.ptr,
+                         output.ptr,
+                         weight.ptr,
+                         bias.ptr,
+                         in_dim,
+                         out_dim,
+                         batch_size);
+}
+
 void Linear::forward_task(Task const *task,
                           std::vector<PhysicalRegion> const &regions,
                           Context ctx,
                           Runtime *runtime) {
-  Domain in_domain = runtime->get_index_space_domain(
+  Domain input_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
-  switch (in_domain.get_dim()) {
+  LinearMeta const *m = *((LinearMeta **)task->local_args);
+  if (m->quantization_type == DT_NONE) {
+    assert(m->input_type[0] == m->weight_type[0]);
+  }
+  assert(m->input_type[0] == m->output_type[0]);
+  switch (input_domain.get_dim()) {
 #define DIMFUNC(DIM)                                                           \
   case DIM:                                                                    \
-    return forward_task_with_dim<DIM>(task, regions, ctx, runtime);
+    if (m->output_type[0] == DT_HALF) {                                        \
+      if (m->quantization_type != DT_NONE) {                                   \
+        return forward_task_with_dim<half, char, DIM>(                         \
+            task, regions, ctx, runtime);                                      \
+      } else {                                                                 \
+        return forward_task_with_dim<half, half, DIM>(                         \
+            task, regions, ctx, runtime);                                      \
+      }                                                                        \
+    } else if (m->output_type[0] == DT_FLOAT) {                                \
+      if (m->quantization_type != DT_NONE) {                                   \
+        return forward_task_with_dim<float, char, DIM>(                        \
+            task, regions, ctx, runtime);                                      \
+      } else {                                                                 \
+        return forward_task_with_dim<float, float, DIM>(                       \
+            task, regions, ctx, runtime);                                      \
+      }                                                                        \
+    } else {                                                                   \
+      assert(false && "Unsupported data type");                                \
+    }
     LEGION_FOREACH_N(DIMFUNC)
 #undef DIMFUNC
     default:
@@ -409,7 +703,7 @@ void Linear::forward_task(Task const *task,
   regions[2](I): kernel
   regions[3](I): bias
 */
-template <int NDIM>
+template <typename DT, typename WT, int NDIM>
 void Linear::forward_task_with_dim(Task const *task,
                                    std::vector<PhysicalRegion> const &regions,
                                    Context ctx,
@@ -419,25 +713,26 @@ void Linear::forward_task_with_dim(Task const *task,
   assert(regions.size() == (3 + static_cast<size_t>(m->use_bias)));
   assert(task->regions.size() == (3 + static_cast<size_t>(m->use_bias)));
 
-  TensorAccessorR<float, NDIM> acc_input(
+  TensorAccessorR<DT, NDIM> acc_input(
       regions[0], task->regions[0], FID_DATA, ctx, runtime);
-  TensorAccessorW<float, NDIM> acc_output(regions[1],
-                                          task->regions[1],
-                                          FID_DATA,
-                                          ctx,
-                                          runtime,
-                                          false /*readOutput*/);
-  TensorAccessorR<float, NDIM> acc_kernel(
+  TensorAccessorW<DT, NDIM> acc_output(regions[1],
+                                       task->regions[1],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime,
+                                       false /*readOutput*/);
+  TensorAccessorR<WT, NDIM> acc_kernel(
       regions[2], task->regions[2], FID_DATA, ctx, runtime);
   int in_dim = acc_input.rect.hi[0] - acc_input.rect.lo[0] + 1;
   int out_dim = acc_output.rect.hi[0] - acc_output.rect.lo[0] + 1;
   int batch_size = acc_output.rect.volume() / out_dim;
   assert(acc_output.rect.volume() == static_cast<size_t>(out_dim * batch_size));
   assert(acc_input.rect.volume() == static_cast<size_t>(in_dim * batch_size));
-  assert(acc_kernel.rect.volume() == static_cast<size_t>(in_dim * out_dim));
-  float const *acc_bias_ptr = NULL;
-  if (m->use_bias) {
-    TensorAccessorR<float, 3> acc_bias(
+  // assert(acc_kernel.rect.volume() == static_cast<size_t>(in_dim * out_dim));
+  DT const *acc_bias_ptr = nullptr;
+  if (m->use_bias &&
+      !(m->add_bias_only_once && task->index_point.point_data[0] != 0)) {
+    TensorAccessorR<DT, NDIM> acc_bias(
         regions[3], task->regions[3], FID_DATA, ctx, runtime);
     assert(acc_bias.rect.volume() == static_cast<size_t>(out_dim));
     acc_bias_ptr = acc_bias.ptr;
@@ -535,10 +830,21 @@ void Linear::backward_task(Task const *task,
                            Runtime *runtime) {
   Domain in_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
+  LinearMeta const *m = *((LinearMeta **)task->local_args);
+  if (m->quantization_type == DT_NONE) {
+    assert(m->input_type[0] == m->weight_type[0]);
+  }
+  assert(m->input_type[0] == m->output_type[0]);
   switch (in_domain.get_dim()) {
 #define DIMFUNC(DIM)                                                           \
   case DIM:                                                                    \
-    return backward_task_with_dim<DIM>(task, regions, ctx, runtime);
+    if (m->output_type[0] == DT_HALF) {                                        \
+      return backward_task_with_dim<half, DIM>(task, regions, ctx, runtime);   \
+    } else if (m->output_type[0] == DT_FLOAT) {                                \
+      return backward_task_with_dim<float, DIM>(task, regions, ctx, runtime);  \
+    } else {                                                                   \
+      assert(false && "Unsupported data type");                                \
+    }
     LEGION_FOREACH_N(DIMFUNC)
 #undef DIMFUNC
     default:
@@ -555,7 +861,7 @@ void Linear::backward_task(Task const *task,
   regions[5](I/O): filter_grad
   regions[6](I/O): bias_grad
 */
-template <int NDIM>
+template <typename DT, int NDIM>
 void Linear::backward_task_with_dim(Task const *task,
                                     std::vector<PhysicalRegion> const &regions,
                                     Context ctx,
@@ -567,9 +873,9 @@ void Linear::backward_task_with_dim(Task const *task,
   assert(task->regions.size() ==
          (5 + static_cast<size_t>(m->trainableInputs[0]) +
           static_cast<size_t>(m->use_bias)));
-  float *input_grad = NULL;
+  DT *input_grad = nullptr;
   size_t rid = 0;
-  TensorAccessorR<float, NDIM> acc_input(
+  TensorAccessorR<DT, NDIM> acc_input(
       regions[rid], task->regions[rid], FID_DATA, ctx, runtime);
   rid++;
   if (m->trainableInputs[0]) {
@@ -577,39 +883,39 @@ void Linear::backward_task_with_dim(Task const *task,
         ctx, task->regions[rid].region.get_index_space());
     if (domain.get_dim() == NDIM + 1) {
       assert(domain.get_volume() == acc_input.rect.volume());
-      input_grad = helperGetTensorPointerWO<float>(
+      input_grad = helperGetTensorPointerWO<DT>(
           regions[rid], task->regions[rid], FID_DATA, ctx, runtime);
     } else {
-      TensorAccessorW<float, NDIM> acc_replica_grad(regions[rid],
-                                                    task->regions[rid],
-                                                    FID_DATA,
-                                                    ctx,
-                                                    runtime,
-                                                    true /*readOutput*/);
+      TensorAccessorW<DT, NDIM> acc_replica_grad(regions[rid],
+                                                 task->regions[rid],
+                                                 FID_DATA,
+                                                 ctx,
+                                                 runtime,
+                                                 true /*readOutput*/);
       assert(acc_replica_grad.rect.volume() == acc_input.rect.volume());
       input_grad = acc_replica_grad.ptr;
     }
     rid++;
   }
-  TensorAccessorR<float, NDIM> acc_output(
+  TensorAccessorR<DT, NDIM> acc_output(
       regions[rid], task->regions[rid], FID_DATA, ctx, runtime);
   rid++;
-  TensorAccessorW<float, NDIM> acc_output_grad(regions[rid],
-                                               task->regions[rid],
-                                               FID_DATA,
-                                               ctx,
-                                               runtime,
-                                               true /*readOutput*/);
+  TensorAccessorW<DT, NDIM> acc_output_grad(regions[rid],
+                                            task->regions[rid],
+                                            FID_DATA,
+                                            ctx,
+                                            runtime,
+                                            true /*readOutput*/);
   rid++;
-  TensorAccessorR<float, NDIM> acc_kernel(
+  TensorAccessorR<DT, NDIM> acc_kernel(
       regions[rid], task->regions[rid], FID_DATA, ctx, runtime);
   rid++;
-  TensorAccessorW<float, NDIM> acc_kernel_grad(regions[rid],
-                                               task->regions[rid],
-                                               FID_DATA,
-                                               ctx,
-                                               runtime,
-                                               true /*readOutput*/);
+  TensorAccessorW<DT, NDIM> acc_kernel_grad(regions[rid],
+                                            task->regions[rid],
+                                            FID_DATA,
+                                            ctx,
+                                            runtime,
+                                            true /*readOutput*/);
   rid++;
   // make sure the sizes match
   int in_dim = acc_input.rect.hi[0] - acc_input.rect.lo[0] + 1;
@@ -621,17 +927,17 @@ void Linear::backward_task_with_dim(Task const *task,
   assert(acc_kernel.rect.volume() == static_cast<size_t>(in_dim * out_dim));
   assert(acc_kernel_grad.rect.volume() ==
          static_cast<size_t>(in_dim * out_dim));
-  float *acc_bias_grad_ptr = NULL;
+  DT *acc_bias_grad_ptr = nullptr;
   if (m->use_bias) {
-    TensorAccessorW<float, 3> acc_bias_grad(regions[rid],
-                                            task->regions[rid],
-                                            FID_DATA,
-                                            ctx,
-                                            runtime,
-                                            true /*readOutput*/);
+    TensorAccessorW<DT, 3> acc_bias_grad(regions[rid],
+                                         task->regions[rid],
+                                         FID_DATA,
+                                         ctx,
+                                         runtime,
+                                         true /*readOutput*/);
     rid++;
     assert(acc_bias_grad.rect.volume() == static_cast<size_t>(out_dim));
-    acc_bias_grad_ptr = static_cast<float *>(acc_bias_grad.ptr);
+    acc_bias_grad_ptr = static_cast<DT *>(acc_bias_grad.ptr);
   }
   assert(rid == regions.size());
 
@@ -807,9 +1113,9 @@ bool Linear::measure_operator_cost(Simulator *sim,
   m->activation = activation;
   m->kernel_reg_type = kernel_reg_type;
   m->kernel_reg_lambda = kernel_reg_lambda;
-  m->input_type = inputs[0]->data_type;
-  m->weight_type = this->data_type;
-  m->output_type = outputs[0]->data_type;
+  m->input_type[0] = inputs[0]->data_type;
+  m->weight_type[0] = this->data_type;
+  m->output_type[0] = outputs[0]->data_type;
   assert(m->profiling == false);
 
   init_kernel(m, output_n, output_c);
@@ -925,12 +1231,15 @@ bool operator==(LinearParams const &lhs, LinearParams const &rhs) {
 
 void Linear::serialize(Legion::Serializer &sez) const {
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
   sez.serialize(this->out_channels);
   sez.serialize(this->activation);
   sez.serialize(this->kernel_reg_type);
   sez.serialize(this->kernel_reg_lambda);
   sez.serialize(this->use_bias);
   sez.serialize(this->data_type);
+  sez.serialize(this->quantization_type);
+  sez.serialize(this->offload);
 }
 
 /* static */
@@ -946,15 +1255,20 @@ Node Linear::deserialize(FFModel &ff,
   float kernel_reg_lambda;
   bool use_bias;
   DataType data_type;
-  size_t id;
+  DataType quantization_type;
+  bool offload;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
   dez.deserialize(out_channels);
   dez.deserialize(activation);
   dez.deserialize(kernel_reg_type);
   dez.deserialize(kernel_reg_lambda);
   dez.deserialize(use_bias);
   dez.deserialize(data_type);
+  dez.deserialize(quantization_type);
+  dez.deserialize(offload);
 
   LinearParams params;
   params.activation = activation;
@@ -964,6 +1278,8 @@ Node Linear::deserialize(FFModel &ff,
   params.use_bias = use_bias;
   params.data_type = data_type;
   params.layer_guid = layer_guid;
+  params.quantization_type = quantization_type;
+  params.offload = offload;
   return ff.get_or_create_node<Linear>(inputs[0], params);
 }
 
@@ -976,6 +1292,8 @@ LinearParams Linear::get_params() const {
   params.activation = this->activation;
   params.kernel_reg_type = this->kernel_reg_type;
   params.kernel_reg_lambda = this->kernel_reg_lambda;
+  params.quantization_type = this->quantization_type;
+  params.offload = this->offload;
 
   return params;
 }
@@ -999,6 +1317,11 @@ bool LinearParams::is_valid(ParallelTensorShape const &input_shape) const {
   return is_valid;
 }
 
+/** @brief  A wrapper around the main version of the solve_dims function.
+ *
+ * It takes a the input tensor as a parameter, instead of the input's
+ * ParallelTensorShape.
+ */
 void LinearParams::solve_dims(const ParallelTensor input,
                               ParallelDim output_dims[MAX_TENSOR_DIM],
                               int *output_ndims,
@@ -1015,6 +1338,13 @@ void LinearParams::solve_dims(const ParallelTensor input,
                    bias_ndims);
 }
 
+/** @brief  A wrapper around the main version of the solve_dims function.
+ *
+ * For each of the output, weights, and bias tensors, it takes a
+ * ParallelTensorShape argument, instead of a pointer to an integer variable to
+ * record the number of dimensions, plus a ParallelDim array to record all the
+ * information regarding each dimension.
+ */
 void LinearParams::solve_dims(ParallelTensorShape const &input_shape,
                               ParallelTensorShape &output_shape,
                               ParallelTensorShape &kernel_shape,
@@ -1041,11 +1371,14 @@ void LinearParams::solve_dims(ParallelTensorShape const &input_shape,
 
   std::vector<ParallelDimMappingRecord> mapping;
   this->construct_mappings(mapping, input_shape);
+  // sets the is_replica_dim field to true for the dimensions that are used to
+  // record the number of replicas
   this->mark_replica_dims(input_shape, output_dims, kernel_dims, bias_dims);
 
   solve_parallel_dim_mappings(
       mapping, {input_shape.dims}, {kernel_dims, bias_dims}, {output_dims});
 
+  // sets the dimension sizes of the output, weights, and bias tensors
   this->calculate_nonreplica_dim_sizes(input_shape,
                                        output_dims,
                                        output_ndims,
@@ -1055,6 +1388,34 @@ void LinearParams::solve_dims(ParallelTensorShape const &input_shape,
                                        bias_ndims);
 }
 
+/** @brief  Create a map between each of a tensor's dimension name and its
+ * corresponding index
+ *
+ * The tensor dimension names are defined as follows. For the input tensor, the
+ * first dimension is called INPUT_CHANNEL, and generally corresponds to number
+ * of floats needed to store a single element from the input dataset. For
+ * example, when each element in the dataset is a flattened MNIST image, the
+ * INPUT_CHANNEL dimension will have a size of 28x28=784. The second to last and
+ * last dimensions in the input tensor are, respectively, the INPUT_SAMPLE and
+ * INPUT_REPLICA dimensions. The size of the INPUT_SAMPLE dimension generally
+ * corresponds to the batch size used for training. The size of the
+ * INPUT_REPLICA tells us how many replicas of the tensors have been created.
+ * The dimensions of the output tensor are named analogously: the first
+ * dimension is OUTPUT_CHANNEL, the second to last is OUTPUT_SAMPLE, and the
+ * last one is OUTPUT_REPLICA. Both the input and output tensor may have
+ * additional dimensions, without a name, between {INPUT,OUTPUT}_CHANNEL and
+ * {INPUT,OUTPUT}_SAMPLE. For instance, when the input data comes in textual
+ * form, it is common to have an additional dimension representing the sequence
+ * length. When it comes to the weights, the dimensions are named simply as
+ * KERNEL_CHANNEL_IN (first dimension of a weight's tensor), KERNEL_CHANNEL_OUT
+ * (second dimension) and BIAS_CHANNEL_OUT (first dimension of the bias tensor)
+ *
+ * @param[in] input_shape   A ParallelTensorShape object representing the shape
+ * of the ParallelTensor used for the input to the operator
+ * @return dimension_names  A map from each LinearParams::NamedDimensions to the
+ * index corresponding to that dimension in the input, weight, (bias), or output
+ * tensor.
+ */
 std::unordered_map<LinearParams::NamedDimensions, int>
     LinearParams::get_dimension_names(
         ParallelTensorShape const &input_shape) const {
@@ -1071,6 +1432,43 @@ std::unordered_map<LinearParams::NamedDimensions, int>
           {BIAS_CHANNEL_OUT, 0}};
 }
 
+/** @brief  Sets the size field of ParallelDim objects passed as arguments to
+ * the expected (non-replica) dimensions of the output, weights, and bias
+ * tensors. In addition, it sets the output_ndims, kernel_ndims and bias_ndims
+ * variables to the number of dimensions (including the replica dimensions) of,
+ * respectively, the ouput, weights, and bias tensors.
+ *
+ * The number of dimensions, and dimension sizes of the output, weights, and
+ * bias dimensions are set as follows. The number of dimensions of all three
+ * tensors are copied from the dimensions of the input tensor. The replica
+ * dimensions are not subtracted or otherwise excluded. The size of the output
+ * tensor dimensions are also copied from the input tensor, with the exception
+ * of the last dimension (replica dimension), which is not set, and the first
+ * dimension, whose size is set equal to the out_channels member of the
+ * LinearParams struct, which in turn is set by the outDim parameter of the
+ * FModel::dense function. When it comes to the size of the weights dimensions,
+ * the first dimension is set to have size equal to the quotient of the size of
+ * the INPUT_CHANNEL dimension of the input (first dimension) and the degree
+ * (number of partitions) of the same input dimension. The second dimension of
+ * the the weights tensor is set equal to out_channels, just like the first
+ * dimension of the output tensor. Finally, the size of the first dimension of
+ * the bias tensor is also set equal to the value of out_channels.
+ *
+ * @param[in]   input_shape   A required argument recording the dimensions of
+ * the input tensor
+ * @param[out]  output_dims   An array of ParallelDim objects representing the
+ * dimensions of the output tensor
+ * @param[out]  output_ndims  The number of dimensions (including the replica
+ * dimension(s)) of the output tensor
+ * @param[out]  kernel_dims   An array of ParallelDim objects representing the
+ * dimensions of the weights tensor
+ * @param[out]  kernel_ndims  The number of dimensions (including the replica
+ * dimension(s)) of the weights tensor
+ * @param[out]  bias_dims     An array of ParallelDim objects representing the
+ * dimensions of the bias tensor
+ * @param[out]  bias_ndims    The number of dimensions (including the replica
+ * dimension(s)) of the bias tensor
+ */
 void LinearParams::calculate_nonreplica_dim_sizes(
     ParallelTensorShape const &input_shape,
     ParallelDim output_dims[MAX_TENSOR_DIM],
@@ -1103,6 +1501,20 @@ void LinearParams::calculate_nonreplica_dim_sizes(
   }
 }
 
+/** @brief Switch the is_replica_dim field to true in each ParallelDim of
+ *         the output, weight and bias tensor, if the corresponding dimension
+ *         is used to keep track of the number of replicas
+ *
+ * @param[in]   input_shape   A required argument recording the dimensions of
+ * the input tensor
+ * @param[out]  output_dims   An array of ParallelDim objects representing the
+ * dimensions of the output tensor
+ * @param[out]  kernel_dims   An array of ParallelDim objects representing the
+ * dimensions of the weights tensor
+ * @param[out]  bias_dims     An array of ParallelDim objects representing the
+ * dimensions of the bias tensor
+ *
+ */
 void LinearParams::mark_replica_dims(
     ParallelTensorShape const &input_shape,
     ParallelDim output_dims[MAX_TENSOR_DIM],
@@ -1179,6 +1591,8 @@ size_t hash<FlexFlow::LinearParams>::operator()(
   hash_combine(key, params.activation);
   hash_combine(key, params.kernel_reg_type);
   hash_combine(key, params.kernel_reg_lambda);
+  hash_combine(key, params.quantization_type);
+  hash_combine(key, params.offload);
   return key;
 }
 }; // namespace std
diff --git a/src/ops/noop.cc b/src/ops/noop.cc
index 94fff30553..da2d4922e3 100644
--- a/src/ops/noop.cc
+++ b/src/ops/noop.cc
@@ -24,6 +24,7 @@ using Legion::coord_t;
 using Legion::Domain;
 using Legion::FutureMap;
 using Legion::IndexLauncher;
+using Legion::IndexSpace;
 using Legion::InlineLauncher;
 using Legion::LogicalPartition;
 using Legion::LogicalRegion;
@@ -94,8 +95,93 @@ OpMeta *NoOp::init_task(Task const *task,
   return m;
 }
 
+void NoOp::init_inference(FFModel const &ff,
+                          std::vector<ParallelTensor> const &batch_inputs,
+                          std::vector<ParallelTensor> const &batch_outputs,
+                          MachineView const *mv) {
+  parallel_is = batch_outputs[0]->parallel_is;
+  assert(parallel_is != IndexSpace::NO_SPACE);
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  if (op_type == OP_INPUT && batch_outputs[0]->initializer != nullptr) {
+    ConstantInitializer *initializer =
+        (ConstantInitializer *)batch_outputs[0]->initializer;
+    Runtime *runtime = ff.config.lg_hlr;
+    Context ctx = ff.config.lg_ctx;
+    ArgumentMap argmap;
+    IndexLauncher launcher(
+        CONSTANT_INIT_TASK_ID,
+        parallel_is,
+        TaskArgument(initializer, sizeof(ConstantInitializer)),
+        argmap,
+        Predicate::TRUE_PRED,
+        false /*must*/,
+        0 /*mapper_id*/,
+        machine_view_hash);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    runtime->execute_index_space(ctx, launcher);
+  } else if (op_type == OP_INPUT) {
+    // For OP_INPUT, initialize tensor to zero
+    assert(batch_outputs[0]->region != LogicalRegion::NO_REGION);
+    if (batch_outputs[0]->part == LogicalPartition::NO_PART) {
+      return;
+    }
+    ConstantInitializer *initializer = NULL;
+    if (batch_outputs[0]->data_type == DT_FLOAT) {
+      initializer = new ConstantInitializer(0.0f);
+    } else if (batch_outputs[0]->data_type == DT_INT64) {
+      initializer = new ConstantInitializer((int64_t)0);
+    } else if (batch_outputs[0]->data_type == DT_INT32) {
+      initializer = new ConstantInitializer((int)0);
+    }
+    Runtime *runtime = ff.config.lg_hlr;
+    Context ctx = ff.config.lg_ctx;
+    ArgumentMap argmap;
+    IndexLauncher launcher(
+        CONSTANT_INIT_TASK_ID,
+        parallel_is,
+        TaskArgument(initializer, sizeof(ConstantInitializer)),
+        argmap,
+        Predicate::TRUE_PRED,
+        false /*must*/,
+        0 /*mapper_id*/,
+        machine_view_hash);
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[0]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[0]->region));
+    launcher.add_field(0, FID_DATA);
+    runtime->execute_index_space(ctx, launcher);
+  } else if (op_type == OP_WEIGHT) {
+    ArgumentMap argmap;
+    Context ctx = ff.config.lg_ctx;
+    Runtime *runtime = ff.config.lg_hlr;
+    set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+    IndexLauncher launcher(NOOP_INIT_TASK_ID,
+                           parallel_is,
+                           TaskArgument(NULL, 0),
+                           argmap,
+                           Predicate::TRUE_PRED,
+                           false /*must*/,
+                           0 /*mapper_id*/,
+                           machine_view_hash);
+    FutureMap fm = runtime->execute_index_space(ctx, launcher);
+    fm.wait_all_results();
+    set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+  }
+}
+
 void NoOp::init(FFModel const &ff) {
   parallel_is = outputs[0]->parallel_is;
+  assert(parallel_is != IndexSpace::NO_SPACE);
   if (op_type == OP_INPUT && outputs[0]->initializer != nullptr) {
     ConstantInitializer *initializer =
         (ConstantInitializer *)outputs[0]->initializer;
@@ -172,6 +258,15 @@ void NoOp::init(FFModel const &ff) {
 
 void NoOp::forward(FFModel const &ff) {}
 
+FutureMap NoOp::inference(FFModel const &ff,
+                          BatchConfigFuture const &bc,
+                          std::vector<ParallelTensor> const &batch_inputs,
+                          std::vector<ParallelTensor> const &batch_outputs,
+                          MachineView const *mv) {
+  FutureMap empty;
+  return empty;
+}
+
 void NoOp::backward(FFModel const &ff) {}
 
 bool NoOp::measure_operator_cost(Simulator *sim,
diff --git a/src/ops/reduce.cc b/src/ops/reduce.cc
index e25d810c6c..6c999c8858 100644
--- a/src/ops/reduce.cc
+++ b/src/ops/reduce.cc
@@ -374,6 +374,7 @@ void Reduce::serialize(Legion::Serializer &sez) const {
   }
   sez.serialize(params.keepdims);
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
 }
 
 using PCG::Node;
@@ -392,9 +393,10 @@ Node Reduce::deserialize(FFModel &ff,
     axes.push_back(dim_idx);
   }
   dez.deserialize(keepdims);
-  size_t id;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
 
   return ff.get_or_create_node<Reduce>(inputs[0], {axes, keepdims, layer_guid});
 }
diff --git a/src/ops/reshape.cc b/src/ops/reshape.cc
index 2b8a60bf21..41c3fcdbf1 100644
--- a/src/ops/reshape.cc
+++ b/src/ops/reshape.cc
@@ -410,6 +410,7 @@ void Reshape::serialize(Legion::Serializer &sez) const {
     sez.serialize(this->shape_array[i]);
   }
   sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
 }
 
 using PCG::Node;
@@ -427,9 +428,10 @@ Node Reshape::deserialize(FFModel &ff,
     dez.deserialize(value);
     shape.push_back(value);
   }
-  size_t id;
+  size_t id, transformer_layer_id;
   dez.deserialize(id);
-  LayerID layer_guid(id);
+  dez.deserialize(transformer_layer_id);
+  LayerID layer_guid(id, transformer_layer_id);
 
   ReshapeParams params;
   params.shape = shape;
diff --git a/src/ops/rms_norm.cc b/src/ops/rms_norm.cc
new file mode 100644
index 0000000000..1f21591130
--- /dev/null
+++ b/src/ops/rms_norm.cc
@@ -0,0 +1,455 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/rms_norm.h"
+#include "flexflow/model.h"
+#include "flexflow/ops/kernels/rms_norm_kernels.h"
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::Domain;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+
+using namespace FlexFlow::Kernels::RMSNorm;
+
+bool operator==(RMSNormParams const &lhs, RMSNormParams const &rhs) {
+  return lhs.layer_guid == rhs.layer_guid && lhs.eps == rhs.eps;
+}
+
+bool RMSNormParams::is_valid(ParallelTensorShape const &input) const {
+  return input.is_valid();
+}
+
+RMSNormParams RMSNorm::get_params() const {
+  RMSNormParams params;
+  params.layer_guid = this->layer_guid;
+  params.eps = this->eps;
+  params.dim = this->dim;
+  return params;
+}
+
+Tensor FFModel::rms_norm(const Tensor input,
+                         float eps,
+                         int dim,
+                         DataType data_type,
+                         char const *name) {
+  if (data_type == DT_NONE) {
+    data_type = input->data_type;
+  }
+  Layer *rm = nullptr;
+  if (data_type != input->data_type) {
+    Tensor casted_input = cast(input, data_type, "type cast for rms_norm");
+    rm = new Layer(this,
+                   OP_RMS_NORM,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   1 /*weights*/,
+                   1 /*outputs*/,
+                   casted_input);
+  } else {
+    rm = new Layer(this,
+                   OP_RMS_NORM,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   1 /*weights*/,
+                   1 /*outputs*/,
+                   input);
+  }
+  rm->outputs[0] = create_tensor_legion_ordering(
+      input->num_dims, input->dims, data_type, rm, 0, true /*create_grad*/);
+
+  // weights
+  int weight_dims[1] = {dim};
+  rm->weights[0] = create_weight_legion_ordering(1,
+                                                 weight_dims,
+                                                 data_type,
+                                                 rm,
+                                                 true /*create_grad*/,
+                                                 nullptr,
+                                                 CHOSEN_SYNC_TYPE);
+
+  rm->add_float_property("eps", eps);
+  rm->add_int_property("dim", dim);
+  layers.push_back(rm);
+  return rm->outputs[0];
+}
+
+Op *RMSNorm::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  float eps;
+  layer->get_float_property("eps", eps);
+  long long value;
+  layer->get_int_property("dim", value);
+  int dim = value;
+
+  return new RMSNorm(
+      model, layer->layer_guid, inputs[0], eps, dim, false, layer->name);
+}
+
+RMSNorm::RMSNorm(FFModel &model,
+                 RMSNormParams const &params,
+                 ParallelTensor const input,
+                 bool allocate_weights = false,
+                 char const *name)
+    : RMSNorm(model,
+              params.layer_guid,
+              input,
+              params.eps,
+              params.dim,
+              allocate_weights,
+              name) {}
+
+RMSNorm::RMSNorm(FFModel &model,
+                 RMSNorm const &other,
+                 const ParallelTensor input,
+                 bool allocate_weights)
+    : RMSNorm(model,
+              other.layer_guid,
+              input,
+              other.eps,
+              other.dim,
+              allocate_weights,
+              other.name) {}
+RMSNorm::RMSNorm(FFModel &model,
+                 LayerID const &_layer_guid,
+                 const ParallelTensor _input,
+                 float _eps,
+                 int dim,
+                 bool allocate_weights,
+                 char const *name)
+    : Op(model,
+         OP_RMS_NORM,
+         _input->data_type,
+         name,
+         1 /*num of inputs tensor */,
+         1 /*num of weights tensor */,
+         1 /*onum of utputs tensor */,
+         _input) {
+  eps = _eps;
+  inputs[0] = _input;
+  layer_guid = _layer_guid;
+  int num_dims = _input->num_dims;
+  this->dim = dim;
+  data_dim = _input->dims[0].size;
+  effective_batch_size = 1;
+  for (int i = 1; i <= num_dims - 2; i++) {
+    effective_batch_size *= _input->dims[i].size;
+  }
+  // Currently assert that all non-replica dims are not parallelized
+  // We only support parallelism along the replica dim now
+  for (int i = 0; i < _input->num_dims - 1; i++) {
+    assert(_input->dims[i].degree == 1);
+  }
+  // output has the same parallel dims as input
+  ParallelDim output_dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < _input->num_dims; i++) {
+    output_dims[i] = _input->dims[i];
+  }
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, output_dims, _input->data_type, this);
+  if (allocate_weights) {
+    // weights should have the shape of (data_dim, data_dim)
+    ParallelDim new_weight_dims[MAX_TENSOR_DIM];
+
+    new_weight_dims[0].size = dim;
+    new_weight_dims[0].degree = 1;
+    new_weight_dims[0].parallel_idx = -1;
+    new_weight_dims[1] = _input->dims[_input->num_dims - 1]; // replica dim
+
+    // weights
+    Initializer *kernel_initializer = new GlorotUniform(std::rand() /*seed*/);
+    weights[0] =
+        model.create_parallel_weight_legion_ordering(2,
+                                                     new_weight_dims,
+                                                     _input->data_type,
+                                                     nullptr /*owner_op*/,
+                                                     false /*create_grad*/,
+                                                     kernel_initializer,
+                                                     CHOSEN_SYNC_TYPE);
+  }
+}
+
+void RMSNorm::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(RMSNROM_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(RMSNorm)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+void RMSNorm::init_inference(FFModel const &ff,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+
+  IndexLauncher launcher(RMSNROM_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(RMSNorm)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+OpMeta *RMSNorm::init_task(Task const *task,
+                           std::vector<PhysicalRegion> const &regions,
+                           Context ctx,
+                           Runtime *runtime) {
+  RMSNorm *rn = (RMSNorm *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  RMSNormMeta *meta = new RMSNormMeta(handle, rn, gpu_mem_allocator);
+  return meta;
+}
+
+void RMSNorm::forward(FFModel const &ff) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_forward(ff, argmap);
+  IndexLauncher launcher(RMSNROM_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(2, FID_DATA);
+  runtime->execute_index_space(ctx, launcher);
+}
+
+FutureMap RMSNorm::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+
+  IndexLauncher launcher(RMSNROM_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(2, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+/*
+  regions[0](I): input
+  regions[1](O): output
+  regions[2](I/O): weight
+*/
+void RMSNorm::forward_task(Task const *task,
+                           std::vector<PhysicalRegion> const &regions,
+                           Context ctx,
+                           Runtime *runtime) {
+  assert(task->regions.size() == 3);
+  assert(regions.size() == 3);
+  RMSNormMeta const *m = *((RMSNormMeta **)task->local_args);
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR weight = helperGetGenericTensorAccessorRO(
+      m->weight_type[0], regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  forward_kernel_wrapper(m, input, weight, output);
+}
+
+void RMSNorm::serialize(Legion::Serializer &sez) const {
+  sez.serialize(this->layer_guid.id);
+  sez.serialize(this->layer_guid.transformer_layer_id);
+  sez.serialize(this->eps);
+  sez.serialize(this->dim);
+}
+
+using PCG::Node;
+/*static*/
+Node RMSNorm::deserialize(FFModel &ff,
+                          Legion::Deserializer &dez,
+                          ParallelTensor inputs[],
+                          int num_inputs) {
+  assert(num_inputs == 1);
+  float eps;
+  size_t id, transformer_layer_id;
+  int dim;
+  dez.deserialize(id);
+  dez.deserialize(transformer_layer_id);
+
+  LayerID layer_guid(id, transformer_layer_id);
+  dez.deserialize(eps);
+  dez.deserialize(dim);
+  RMSNormParams params;
+  params.layer_guid = layer_guid;
+  params.eps = eps;
+  params.dim = dim;
+  return ff.get_or_create_node<RMSNorm>(inputs[0], params);
+}
+
+Op *RMSNorm::materialize(FFModel &ff,
+                         ParallelTensor inputs[],
+                         int num_inputs) const {
+  RMSNormParams params = get_params();
+  return new RMSNorm(ff, params, inputs[0], this->name);
+}
+
+void RMSNorm::backward(FFModel const &ff) {}
+
+bool RMSNorm::measure_operator_cost(Simulator *sim,
+                                    MachineView const &mv,
+                                    CostMetrics &cost_metrics) const {
+  return false;
+}
+
+} // namespace FlexFlow
+namespace std {
+size_t hash<FlexFlow::RMSNormParams>::operator()(
+    FlexFlow::RMSNormParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.eps);
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.dim);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/sampling.cc b/src/ops/sampling.cc
new file mode 100644
index 0000000000..6eb62b2933
--- /dev/null
+++ b/src/ops/sampling.cc
@@ -0,0 +1,354 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/sampling.h"
+#include "flexflow/model.h"
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::InlineLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+// For an input tensor, computes the top k entries in each row
+// (resp. vector along the last dimension). Thus,
+// values.shape = indices.shape = input.shape[:-1] + [k]
+Tensor FFModel::sampling(const Tensor input, float top_p, char const *name) {
+  Layer *li = new Layer(this,
+                        OP_SAMPLING,
+                        input->data_type,
+                        name,
+                        1 /*inputs*/,
+                        0 /*weights*/,
+                        1 /*outputs*/,
+                        input);
+  {
+    int numdims = input->num_dims;
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    // now just support 1 output
+    dims[0] = 1;
+    // li->outputs[0] = create_tensor_legion_ordering(
+    //     numdims, dims, input->data_type, li, 0, true /*create_grad*/);
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, DT_INT32, li, 0, false /*create_grad*/);
+  }
+  layers.push_back(li);
+  li->add_float_property("top_p", top_p);
+  // outputs[0] = li->outputs[0];
+  // outputs[1] = li->outputs[1];
+  return li->outputs[0];
+}
+
+Op *Sampling::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  float top_p;
+  layer->get_float_property("top_p", top_p);
+  return new Sampling(model, inputs[0], top_p, layer->name);
+}
+
+SamplingParams Sampling::get_params() const {
+  SamplingParams params;
+  params.top_p = this->top_p;
+  return params;
+}
+
+bool SamplingParams::is_valid(ParallelTensorShape const &) const {
+  return true;
+}
+
+bool operator==(SamplingParams const &lhs, SamplingParams const &rhs) {
+  return lhs.top_p == rhs.top_p;
+}
+
+Sampling::Sampling(FFModel &model,
+                   const ParallelTensor _input,
+                   float _top_p,
+                   char const *name)
+    : Op(model,
+         OP_SAMPLING,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         0 /*weights*/,
+         1 /*outputs*/,
+         _input),
+      top_p(_top_p) {
+  int numdim = inputs[0]->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = inputs[0]->dims[i];
+  }
+  dims[0].size = 1;
+  std::cout << "degree: " << inputs[0]->dims[0].degree << "\n";
+  assert(inputs[0]->dims[0].degree == 1);
+  assert(inputs[0]->dims[0].parallel_idx == -1);
+  //   outputs[0] = model.create_parallel_tensor_legion_ordering(
+  //       numdim, dims, _input->data_type, this, 0 /*owner_idx*/);
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      numdim, dims, DT_INT32, this, 0 /*owner_idx*/);
+}
+
+Sampling::Sampling(FFModel &model,
+                   Sampling const &other,
+                   const ParallelTensor input)
+    : Sampling(model, input, other.top_p, other.name) {}
+
+Sampling::Sampling(FFModel &model,
+                   SamplingParams const &params,
+                   const ParallelTensor input,
+                   char const *name)
+    : Sampling(model, input, params.top_p, name) {}
+
+void Sampling::init_inference(FFModel const &ff,
+                              std::vector<ParallelTensor> const &batch_inputs,
+                              std::vector<ParallelTensor> const &batch_outputs,
+                              MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(SAMPLING_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Sampling)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void Sampling::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(SAMPLING_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Sampling)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+OpMeta *Sampling::init_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  Sampling *s = (Sampling *)task->args;
+  FFHandler handle = *((FFHandler *)task->local_args);
+  GenericTensorAccessorW acc_input =
+      helperGetGenericTensorAccessorRW(s->inputs[0]->data_type,
+                                       regions[0],
+                                       task->regions[0],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+
+  int length = acc_input.domain.hi()[0] - acc_input.domain.lo()[0] + 1;
+  int batch_size = acc_input.domain.get_volume() / length;
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  SamplingMeta *m = new SamplingMeta(
+      handle, s, batch_size, length * batch_size, acc_input, gpu_mem_allocator);
+  m->profiling = s->profiling;
+  m->top_p = s->top_p;
+  return m;
+}
+
+void Sampling::forward(FFModel const &ff) {
+  // Sampling does not support forward
+  assert(false);
+}
+
+FutureMap Sampling::inference(FFModel const &ff,
+                              BatchConfigFuture const &bc,
+                              std::vector<ParallelTensor> const &batch_inputs,
+                              std::vector<ParallelTensor> const &batch_outputs,
+                              MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "Sampling op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(SAMPLING_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+InferenceResult
+    Sampling::inference_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+  assert(regions.size() == 2);
+  assert(task->regions.size() == 2);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  // BatchConfig const *bc = (BatchConfig *)task->args;
+  SamplingMeta const *m = *((SamplingMeta **)task->local_args);
+  if (bc->num_tokens == 0) {
+    // Directly return for empty batch config
+    InferenceResult ir;
+    return ir;
+  }
+
+  GenericTensorAccessorW input = helperGetGenericTensorAccessorRW(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW indices = helperGetGenericTensorAccessorWO(
+      DT_INT32, regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  int batch_size = bc->num_active_tokens();
+  Sampling::forward_kernel_wrapper(m, input, indices, batch_size);
+
+  InferenceResult ir;
+  download_tensor<BatchConfig::TokenId>(
+      indices.get_int32_ptr(), ir.token_ids, batch_size);
+  return ir;
+}
+
+void Sampling::backward(FFModel const &ff) {
+  // Sampling does not support backward
+  assert(false);
+}
+
+void Sampling::serialize(Legion::Serializer &sez) const {
+  sez.serialize(this->top_p);
+}
+
+Node Sampling::deserialize(FFModel &ff,
+                           Legion::Deserializer &dez,
+                           ParallelTensor inputs[],
+                           int num_inputs) {
+  assert(num_inputs == 1);
+  float top_p;
+  dez.deserialize(top_p);
+  SamplingParams params;
+  params.top_p = top_p;
+  return ff.get_or_create_node<Sampling>(inputs[0], params);
+}
+
+Op *Sampling::materialize(FFModel &ff,
+                          ParallelTensor inputs[],
+                          int num_inputs) const {
+  SamplingParams params = get_params();
+  return new Sampling(ff, params, inputs[0], this->name);
+}
+
+bool Sampling::measure_operator_cost(Simulator *sim,
+                                     MachineView const &mv,
+                                     CostMetrics &cost_metrics) const {
+  return false;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::SamplingParams>::operator()(
+    FlexFlow::SamplingParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.top_p);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/sampling.cpp b/src/ops/sampling.cpp
new file mode 100644
index 0000000000..56f3f604d5
--- /dev/null
+++ b/src/ops/sampling.cpp
@@ -0,0 +1,69 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/sampling.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+/*static*/
+template <typename DT>
+void Sampling::forward_kernel(SamplingMeta const *m,
+                              DT *input_ptr,
+                              int *indices_ptr,
+                              float const top_p,
+                              int const length,
+                              int const batch_size,
+                              hipStream_t stream) {}
+
+/*static*/
+void Sampling::forward_kernel_wrapper(SamplingMeta const *m,
+                                      GenericTensorAccessorW const &input,
+                                      GenericTensorAccessorW const &indices,
+                                      int batch_size) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  handle_unimplemented_hip_kernel(OP_RMS_NORM);
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+  }
+}
+
+SamplingMeta::SamplingMeta(FFHandler handler,
+                           Op const *op,
+                           int batch_size,
+                           int total_ele,
+                           GenericTensorAccessorW input,
+                           MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler, op) {}
+
+SamplingMeta::~SamplingMeta(void) {}
+}; // namespace FlexFlow
\ No newline at end of file
diff --git a/src/ops/sampling.cu b/src/ops/sampling.cu
new file mode 100644
index 0000000000..461d72ec71
--- /dev/null
+++ b/src/ops/sampling.cu
@@ -0,0 +1,287 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "cub/cub.cuh"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/sampling.h"
+#include "flexflow/utils/cuda_helper.h"
+#include <curand.h>
+#include <curand_kernel.h>
+
+namespace FlexFlow {
+
+constexpr int SamplingNumThreads = 1024;
+struct BlockPrefixCallbackOp {
+  // Running prefix
+  float running_total;
+  // Constructor
+  __device__ BlockPrefixCallbackOp(float running_total)
+      : running_total(running_total) {}
+  // Callback operator to be entered by the first warp of threads in the block.
+  // Thread-0 is responsible for returning a value for seeding the block-wide
+  // scan.
+  __device__ float operator()(float block_aggregate) {
+    float old_prefix = running_total;
+    running_total += block_aggregate;
+    return old_prefix;
+  }
+};
+
+__global__ void init_idxs(int batch_size,
+                          int vocab_size,
+                          int total_eles,
+                          int *idx,
+                          int *begin_offset,
+                          int *end_offset) {
+  CUDA_KERNEL_LOOP(i, total_eles) {
+    idx[i] = i % vocab_size;
+    if (i % vocab_size == 0) {
+      begin_offset[i / vocab_size] = i;
+      end_offset[i / vocab_size] = i;
+    }
+  }
+}
+
+__global__ void
+    init_random_kernel(curandState *state, int batch_size, long rand) {
+  CUDA_KERNEL_LOOP(i, batch_size) {
+    curand_init(rand, i, 0, &state[i]);
+  }
+}
+
+// multinominal and gather
+template <typename DT, int BLOCK_SIZE>
+__global__ void sampling_topp_kernel(int batch_size,
+                                     int const vocab_size,
+                                     curandState *state,
+                                     DT *sorted_logits,
+                                     int *sorted_idx,
+                                     int *indices_ptr,
+                                     float topp) {
+  // int const vocab_id = threadIdx.x;
+  int const batch_idx = blockIdx.x;
+  __shared__ float random_n;
+  __shared__ long long result_idx;
+
+  // random num
+  if (threadIdx.x == 0) {
+    // number must < topp
+    random_n = curand_uniform(state + batch_idx) * topp;
+    // printf("batch idx: %d, random num%f\n", batch_idx, random_n);
+  }
+
+  __syncthreads();
+
+  // cumsum;
+  typedef cub::BlockScan<float, BLOCK_SIZE> BlockScan;
+  __shared__ typename BlockScan::TempStorage temp_storage;
+
+  int offset = batch_idx * vocab_size;
+  float prefix_sum = 0.0f;
+  BlockPrefixCallbackOp prefix_op(0);
+  result_idx = vocab_size - 1;
+
+  for (long long j = threadIdx.x; j < vocab_size; j += blockDim.x) {
+    float logit = (float)(sorted_logits[offset + j]);
+    BlockScan(temp_storage).InclusiveSum(logit, prefix_sum, prefix_op);
+    prefix_sum /= topp;
+    if (prefix_sum >= random_n) {
+      atomicMin(&result_idx, j);
+    }
+  }
+  indices_ptr[batch_idx] = sorted_idx[offset + result_idx];
+
+  // if (threadIdx.x == 0) {
+  //   printf("selected idx: %d, %d\n", blockIdx.x, result_idx);
+  // }
+}
+
+/*static*/
+template <typename DT>
+void Sampling::forward_kernel(SamplingMeta const *m,
+                              DT *input_ptr,
+                              int *indices_ptr,
+                              float const top_p,
+                              int const length,
+                              int const batch_size,
+                              cudaStream_t stream) {
+  // 1. sort
+  size_t temp_storage_bytes = m->temp_storage_bytes;
+  checkCUDA(cub::DeviceSegmentedRadixSort::SortPairsDescending(
+      m->d_temp_storage,
+      temp_storage_bytes,
+      input_ptr,
+      static_cast<DT *>(m->sorted_logits),
+      m->idx,
+      m->sorted_idx,
+      length * batch_size,
+      batch_size,
+      m->begin_offset,
+      m->end_offset + 1,
+      0,              // begin_bit
+      sizeof(DT) * 8, // end_bit = sizeof(KeyT) * 8
+      stream));
+  int parallelism = batch_size;
+  init_random_kernel<<<GET_BLOCKS(parallelism),
+                       min(CUDA_NUM_THREADS, parallelism),
+                       0,
+                       stream>>>(m->state, batch_size, rand());
+  // sampling
+  sampling_topp_kernel<DT, SamplingNumThreads>
+      <<<batch_size, SamplingNumThreads, 0, stream>>>(
+          batch_size,
+          length,
+          m->state,
+          static_cast<DT *>(m->sorted_logits),
+          m->sorted_idx,
+          indices_ptr,
+          top_p);
+}
+
+/*static*/
+void Sampling::forward_kernel_wrapper(SamplingMeta const *m,
+                                      GenericTensorAccessorW const &input,
+                                      GenericTensorAccessorW const &indices,
+                                      int batch_size) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+  int length = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+
+  if (input.data_type == DT_HALF) {
+    Sampling::forward_kernel<half>(m,
+                                   input.get_half_ptr(),
+                                   indices.get_int32_ptr(),
+                                   m->top_p,
+                                   length,
+                                   batch_size,
+                                   stream);
+  } else if (input.data_type == DT_FLOAT) {
+    Sampling::forward_kernel<float>(m,
+                                    input.get_float_ptr(),
+                                    indices.get_int32_ptr(),
+                                    m->top_p,
+                                    length,
+                                    batch_size,
+                                    stream);
+  } else {
+    assert(false && "Unsupported data type");
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("[Sampling] forward time = %.2lfms\n", elapsed);
+  }
+}
+
+SamplingMeta::SamplingMeta(FFHandler handler,
+                           Op const *op,
+                           int batch_size,
+                           int total_ele,
+                           GenericTensorAccessorW input,
+                           MemoryAllocator &gpu_mem_allocator)
+    : OpMeta(handler, op) {
+  DataType data_type = op->data_type;
+
+  size_t begin_offset_size, end_offset_size;
+  begin_offset_size = end_offset_size = batch_size + 1;
+  size_t idx_size, sorted_idx_size, sorted_logits_size;
+  idx_size = sorted_idx_size = sorted_logits_size = total_ele;
+  size_t state_size = batch_size;
+
+  size_t totalSize = sizeof(int) * (begin_offset_size + end_offset_size +
+                                    idx_size + sorted_idx_size) +
+                     data_type_size(data_type) * sorted_logits_size +
+                     sizeof(curandState) * state_size;
+  gpu_mem_allocator.create_legion_instance(reserveInst, totalSize);
+  begin_offset = gpu_mem_allocator.allocate_instance<int>(begin_offset_size);
+  end_offset = gpu_mem_allocator.allocate_instance<int>(end_offset_size);
+  idx = gpu_mem_allocator.allocate_instance<int>(idx_size);
+  sorted_idx = gpu_mem_allocator.allocate_instance<int>(sorted_idx_size);
+  sorted_logits = gpu_mem_allocator.allocate_instance_untyped(
+      sorted_logits_size * data_type_size(data_type));
+  state = gpu_mem_allocator.allocate_instance<curandState>(state_size);
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  // init offset
+  int parallelism = total_ele;
+  init_idxs<<<GET_BLOCKS(parallelism),
+              min(CUDA_NUM_THREADS, parallelism),
+              0,
+              stream>>>(batch_size,
+                        total_ele / batch_size,
+                        total_ele,
+                        idx,
+                        begin_offset,
+                        end_offset);
+
+  // init sort function
+  if (data_type == DT_FLOAT) {
+    checkCUDA(cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        d_temp_storage,
+        temp_storage_bytes,
+        input.get_float_ptr(),
+        input.get_float_ptr(),
+        idx,
+        idx,
+        total_ele,
+        batch_size,
+        begin_offset,
+        end_offset + 1,
+        0,                             // begin_bit
+        data_type_size(data_type) * 8, // end_bit = sizeof(KeyT) * 8
+        stream));
+  } else if (data_type == DT_HALF) {
+    checkCUDA(cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        d_temp_storage,
+        temp_storage_bytes,
+        input.get_half_ptr(),
+        input.get_half_ptr(),
+        idx,
+        idx,
+        total_ele,
+        batch_size,
+        begin_offset,
+        end_offset + 1,
+        0,                             // begin_bit
+        data_type_size(data_type) * 8, // end_bit = sizeof(KeyT) * 8
+        stream));
+  } else {
+    assert(false && "input type in float and half");
+  }
+
+  gpu_mem_allocator.create_legion_instance(reserveInst, temp_storage_bytes);
+  d_temp_storage =
+      gpu_mem_allocator.allocate_instance_untyped(temp_storage_bytes);
+}
+
+SamplingMeta::~SamplingMeta(void) {
+  if (reserveInst != Realm::RegionInstance::NO_INST) {
+    reserveInst.destroy();
+  }
+}
+}; // namespace FlexFlow
\ No newline at end of file
diff --git a/src/ops/softmax.cc b/src/ops/softmax.cc
index 029b20afd1..450f7c009a 100644
--- a/src/ops/softmax.cc
+++ b/src/ops/softmax.cc
@@ -52,10 +52,16 @@ SoftmaxParams Softmax::get_params() const {
   return params;
 }
 
-Tensor FFModel::softmax(const Tensor _input, int dim, char const *name) {
+Tensor FFModel::softmax(const Tensor _input,
+                        int dim,
+                        DataType data_type,
+                        char const *name) {
+  if (data_type = DT_NONE) {
+    data_type = _input->data_type;
+  }
   Layer *sm = new Layer(this,
                         OP_SOFTMAX,
-                        DT_FLOAT,
+                        data_type,
                         name,
                         1 /*inputs*/,
                         0 /*weights*/,
@@ -67,7 +73,7 @@ Tensor FFModel::softmax(const Tensor _input, int dim, char const *name) {
     dims[i] = _input->dims[i];
   }
   sm->outputs[0] = create_tensor_legion_ordering(
-      numdims, dims, DT_FLOAT, sm, 0, true /*create_grad*/);
+      numdims, dims, data_type, sm, 0, true /*create_grad*/);
   sm->add_int_property("softmax_dim", dim);
   layers.push_back(sm);
   return sm->outputs[0];
@@ -106,7 +112,7 @@ Softmax::Softmax(FFModel &model,
   for (int i = 0; i < numdim; i++) {
     dims[i] = _input->dims[numdim - 1 - i];
   }
-  outputs[0] = model.create_parallel_tensor(numdim, dims, DT_FLOAT, this);
+  outputs[0] = model.create_parallel_tensor(numdim, dims, data_type, this);
 }
 
 Softmax::Softmax(FFModel &model,
@@ -115,6 +121,43 @@ Softmax::Softmax(FFModel &model,
                  char const *name)
     : Softmax(model, input, params.dim, name) {}
 
+void Softmax::init_inference(FFModel const &ff,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(SOFTMAX_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Softmax)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_DISCARD,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void Softmax::init(FFModel const &ff) {
   assert(check_output_input_weight_same_parallel_is());
   parallel_is = outputs[0]->parallel_is;
@@ -184,10 +227,49 @@ OpMeta *Softmax::init_task(Task const *task,
     domain = input_domain;
   }
   SoftmaxMeta *m = new SoftmaxMeta(handle, softmax, domain);
+  m->input_type = softmax->inputs[0]->data_type;
+  m->output_type = softmax->outputs[0]->data_type;
   // checkCUDNN(cudnnCreateTensorDescriptor(&m->outputTensor));
   return m;
 }
 
+FutureMap Softmax::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "Softmax op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(SOFTMAX_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void Softmax::forward(FFModel const &ff) {
   ArgumentMap argmap;
   Context ctx = ff.config.lg_ctx;
@@ -222,10 +304,17 @@ void Softmax::forward_task(Task const *task,
                            Runtime *runtime) {
   Domain in_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
+  SoftmaxMeta const *m = *((SoftmaxMeta **)task->local_args);
   switch (in_domain.get_dim()) {
 #define DIMFUNC(DIM)                                                           \
   case DIM:                                                                    \
-    return forward_task_with_dim<DIM>(task, regions, ctx, runtime);
+    if (m->output_type == DT_HALF) {                                           \
+      return forward_task_with_dim<half, DIM>(task, regions, ctx, runtime);    \
+    } else if (m->output_type == DT_FLOAT) {                                   \
+      return forward_task_with_dim<float, DIM>(task, regions, ctx, runtime);   \
+    } else {                                                                   \
+      assert(false && "Unsupported data type");                                \
+    }
     LEGION_FOREACH_N(DIMFUNC)
 #undef DIMFUNC
     default:
@@ -237,7 +326,7 @@ void Softmax::forward_task(Task const *task,
   regions[0](I): input
   regions[1](O): output
 */
-template <int NDIM>
+template <typename DT, int NDIM>
 void Softmax::forward_task_with_dim(Task const *task,
                                     std::vector<PhysicalRegion> const &regions,
                                     Context ctx,
@@ -246,15 +335,14 @@ void Softmax::forward_task_with_dim(Task const *task,
   assert(task->regions.size() == 2);
   // const Softmax* softmax = (Softmax*) task->args;
   SoftmaxMeta const *m = *((SoftmaxMeta **)task->local_args);
-  TensorAccessorR<float, NDIM> acc_input(
+  TensorAccessorR<DT, NDIM> acc_input(
       regions[0], task->regions[0], FID_DATA, ctx, runtime);
-  TensorAccessorW<float, NDIM> acc_output(regions[1],
-                                          task->regions[1],
-                                          FID_DATA,
-                                          ctx,
-                                          runtime,
-                                          false /*readOutput*/);
-
+  TensorAccessorW<DT, NDIM> acc_output(regions[1],
+                                       task->regions[1],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime,
+                                       false /*readOutput*/);
   forward_kernel_wrapper(m, acc_input.ptr, acc_output.ptr);
 }
 
@@ -292,10 +380,17 @@ void Softmax::backward_task(Task const *task,
                             Runtime *runtime) {
   Domain in_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
+  SoftmaxMeta const *m = *((SoftmaxMeta **)task->local_args);
   switch (in_domain.get_dim()) {
 #define DIMFUNC(DIM)                                                           \
   case DIM:                                                                    \
-    return backward_task_with_dim<DIM>(task, regions, ctx, runtime);
+    if (m->output_type == DT_HALF) {                                           \
+      return backward_task_with_dim<half, DIM>(task, regions, ctx, runtime);   \
+    } else if (m->output_type == DT_FLOAT) {                                   \
+      return backward_task_with_dim<float, DIM>(task, regions, ctx, runtime);  \
+    } else {                                                                   \
+      assert(false && "Unsupported data type");                                \
+    }
     LEGION_FOREACH_N(DIMFUNC)
 #undef DIMFUNC
     default:
@@ -310,7 +405,7 @@ void Softmax::backward_task(Task const *task,
 // Note that the backward task of softmax is actually a no op (i.e., input_grad
 // = output_grad) since the upstream cross_entropy_loss function computes
 // performs softmax_cross_entropy_loss to avoid intermediate zeros
-template <int NDIM>
+template <typename DT, int NDIM>
 void Softmax::backward_task_with_dim(Task const *task,
                                      std::vector<PhysicalRegion> const &regions,
                                      Context ctx,
@@ -319,13 +414,13 @@ void Softmax::backward_task_with_dim(Task const *task,
   assert(task->regions.size() == 2);
   // const Softmax* softmax = (Softmax*) task->args;
   SoftmaxMeta const *m = *((SoftmaxMeta **)task->local_args);
-  TensorAccessorW<float, NDIM> acc_input_grad(regions[0],
-                                              task->regions[0],
-                                              FID_DATA,
-                                              ctx,
-                                              runtime,
-                                              true /*readOutput*/);
-  TensorAccessorR<float, NDIM> acc_output_grad(
+  TensorAccessorW<DT, NDIM> acc_input_grad(regions[0],
+                                           task->regions[0],
+                                           FID_DATA,
+                                           ctx,
+                                           runtime,
+                                           true /*readOutput*/);
+  TensorAccessorR<DT, NDIM> acc_output_grad(
       regions[1], task->regions[1], FID_DATA, ctx, runtime);
   // make sure the image indices match!
   assert(acc_input_grad.rect == acc_output_grad.rect);
@@ -334,6 +429,36 @@ void Softmax::backward_task_with_dim(Task const *task,
       m, acc_input_grad.ptr, acc_output_grad.ptr, acc_input_grad.rect.volume());
 }
 
+InferenceResult
+    Softmax::inference_task(Task const *task,
+                            std::vector<PhysicalRegion> const &regions,
+                            Context ctx,
+                            Runtime *runtime) {
+  Domain in_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  SoftmaxMeta const *m = *((SoftmaxMeta **)task->local_args);
+  switch (in_domain.get_dim()) {
+#define DIMFUNC(DIM)                                                           \
+  case DIM:                                                                    \
+    if (m->output_type == DT_HALF) {                                           \
+      forward_task_with_dim<half, DIM>(task, regions, ctx, runtime);           \
+      break;                                                                   \
+    } else if (m->output_type == DT_FLOAT) {                                   \
+      forward_task_with_dim<float, DIM>(task, regions, ctx, runtime);          \
+      break;                                                                   \
+    } else {                                                                   \
+      assert(false && "Unsupported data type");                                \
+    }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+    default:
+      assert(false);
+  }
+  // FIXME: replace this with actual result
+  InferenceResult ir;
+  return ir;
+}
+
 bool Softmax::get_int_parameter(PMParameter para, int *value) const {
   switch (para) {
     case PM_SOFTMAX_DIM:
diff --git a/src/ops/spec_inc_multihead_self_attention.cc b/src/ops/spec_inc_multihead_self_attention.cc
new file mode 100644
index 0000000000..9395c9aab4
--- /dev/null
+++ b/src/ops/spec_inc_multihead_self_attention.cc
@@ -0,0 +1,860 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/model.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#ifdef INFERENCE_TESTS
+#include <torch/torch.h>
+using namespace at::indexing;
+#endif
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+bool SpecIncMultiHeadSelfAttentionParams::is_valid(
+    ParallelTensorShape const &input) const {
+  bool is_valid = input.is_valid();
+  return is_valid;
+}
+
+Tensor
+    FFModel::spec_inc_multihead_self_attention(const Tensor input,
+                                               int embed_dim,
+                                               int num_heads,
+                                               int kdim,
+                                               int vdim,
+                                               float dropout,
+                                               bool bias,
+                                               bool add_bias_kv,
+                                               bool add_zero_attn,
+                                               DataType data_type,
+                                               Initializer *kernel_initializer,
+                                               bool apply_rotary_embedding,
+                                               bool scaling_query,
+                                               float scaling_factor,
+                                               bool qk_prod_scaling,
+                                               char const *name) {
+  return spec_inc_multiquery_self_attention(input,
+                                            embed_dim,
+                                            num_heads,
+                                            num_heads,
+                                            kdim,
+                                            vdim,
+                                            dropout,
+                                            bias,
+                                            add_bias_kv,
+                                            add_zero_attn,
+                                            data_type,
+                                            kernel_initializer,
+                                            apply_rotary_embedding,
+                                            scaling_query,
+                                            scaling_factor,
+                                            qk_prod_scaling,
+                                            name);
+}
+
+Tensor
+    FFModel::spec_inc_multiquery_self_attention(const Tensor input,
+                                                int embed_dim,
+                                                int num_q_heads,
+                                                int num_kv_heads,
+                                                int kdim,
+                                                int vdim,
+                                                float dropout,
+                                                bool bias,
+                                                bool add_bias_kv,
+                                                bool add_zero_attn,
+                                                DataType data_type,
+                                                Initializer *kernel_initializer,
+                                                bool apply_rotary_embedding,
+                                                bool scaling_query,
+                                                float scaling_factor,
+                                                bool qk_prod_scaling,
+                                                char const *name) {
+  if (data_type == DT_NONE) {
+    data_type = input->data_type;
+  }
+  Layer *li = nullptr;
+  int weight_num = bias ? 2 : 1;
+  if (data_type != input->data_type) {
+    Tensor casted_input = cast(input, data_type, "type cast for IncMHA");
+    li = new Layer(this,
+                   OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   weight_num /*weights*/,
+                   1 /*outputs*/,
+                   casted_input);
+  } else {
+    li = new Layer(this,
+                   OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   weight_num /*weights*/,
+                   1 /*outputs*/,
+                   input);
+  }
+  {
+    int numdims = input->num_dims;
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    dims[0] = embed_dim;
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, data_type, li, 0, true /*create_grad*/);
+  }
+  // Compute weight size
+  int qProjSize = kdim, kProjSize = kdim, vProjSize = kdim,
+      oProjSize = embed_dim;
+  int qSize = input->dims[0], kSize = input->dims[0], vSize = input->dims[0];
+  int qParas = qProjSize * qSize;
+  int kParas = kProjSize * kSize;
+  int vParas = vProjSize * vSize;
+  int oParas = oProjSize * (vProjSize > 0 ? vProjSize : vSize);
+  int weight_size = qParas * num_q_heads + kParas * num_kv_heads +
+                    vParas * num_kv_heads + oParas * num_q_heads;
+  {
+    int dims[1] = {weight_size};
+    li->weights[0] = create_weight_legion_ordering(1,
+                                                   dims,
+                                                   data_type,
+                                                   li,
+                                                   true /*create_grad*/,
+                                                   kernel_initializer,
+                                                   CHOSEN_SYNC_TYPE);
+  }
+  if (bias) {
+    // q, k, v, o
+    int dims[1] = {qProjSize * num_q_heads +
+                   (kProjSize + vProjSize) * num_kv_heads + oProjSize};
+    li->weights[1] = create_weight_legion_ordering(1,
+                                                   dims,
+                                                   data_type,
+                                                   li,
+                                                   true /*create_grad*/,
+                                                   kernel_initializer,
+                                                   CHOSEN_SYNC_TYPE);
+  }
+  li->data_type = data_type;
+  li->add_int_property("embed_dim", embed_dim);
+  li->add_int_property("num_q_heads", num_q_heads);
+  li->add_int_property("num_kv_heads", num_kv_heads);
+  li->add_int_property("kdim", kdim);
+  li->add_int_property("vdim", vdim);
+  li->add_int_property("bias", bias);
+  li->add_int_property("add_bias_kv", add_bias_kv);
+  li->add_int_property("add_zero_attn", add_zero_attn);
+  li->add_float_property("dropout", dropout);
+  li->add_int_property("apply_rotary_embedding", apply_rotary_embedding);
+  li->add_int_property("scaling_query", scaling_query);
+  li->add_float_property("scaling_factor", scaling_factor);
+  li->add_int_property("qk_prod_scaling", qk_prod_scaling);
+  layers.push_back(li);
+  return li->outputs[0];
+}
+
+Op *SpecIncMultiHeadSelfAttention::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+
+  std::cout << "spec create operator: " << layer->name << "\n";
+  long long value;
+  layer->get_int_property("embed_dim", value);
+  int embed_dim = value;
+  layer->get_int_property("num_q_heads", value);
+  int num_q_heads = value;
+  layer->get_int_property("num_kv_heads", value);
+  int num_kv_heads = value;
+  layer->get_int_property("kdim", value);
+  int kdim = value;
+  layer->get_int_property("vdim", value);
+  int vdim = value;
+  float dropout;
+  layer->get_float_property("dropout", dropout);
+  layer->get_int_property("bias", value);
+  bool bias = (bool)value;
+  layer->get_int_property("add_bias_kv", value);
+  bool add_bias_kv = (bool)value;
+  layer->get_int_property("add_zero_attn", value);
+  bool add_zero_attn = (bool)value;
+  layer->get_int_property("apply_rotary_embedding", value);
+  bool apply_rotary_embedding = (bool)value;
+  layer->get_int_property("scaling_query", value);
+  bool scaling_query = (bool)value;
+  float scaling_factor;
+  layer->get_float_property("scaling_factor", scaling_factor);
+  layer->get_int_property("qk_prod_scaling", value);
+  bool qk_prod_scaling = (bool)value;
+  return new SpecIncMultiHeadSelfAttention(model,
+                                           layer->layer_guid,
+                                           inputs[0],
+                                           embed_dim,
+                                           num_q_heads,
+                                           num_kv_heads,
+                                           kdim,
+                                           vdim,
+                                           dropout,
+                                           bias,
+                                           add_bias_kv,
+                                           add_zero_attn,
+                                           apply_rotary_embedding,
+                                           scaling_query,
+                                           scaling_factor,
+                                           qk_prod_scaling,
+                                           false /*allocate_weights*/,
+                                           layer->name);
+}
+
+SpecIncMultiHeadSelfAttention::SpecIncMultiHeadSelfAttention(
+    FFModel &model,
+    LayerID const &_layer_guid,
+    const ParallelTensor _input,
+    int _embed_dim,
+    int _num_q_heads,
+    int _num_kv_heads,
+    int _kdim,
+    int _vdim,
+    float _dropout,
+    bool _bias,
+    bool _add_bias_kv,
+    bool _add_zero_attn,
+    bool _apply_rotary_embedding,
+    bool _scaling_query,
+    float _scaling_factor,
+    bool _qk_prod_scaling,
+    bool allocate_weights,
+    char const *name)
+    // Initializer* _bias_initializer)
+    : Op(model,
+         OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         (_bias ? 2 : 1) /*weights*/,
+         1 /*outputs*/,
+         _input),
+      num_q_heads(_num_q_heads), num_kv_heads(_num_kv_heads), dropout(_dropout),
+      bias(_bias), add_bias_kv(_add_bias_kv), add_zero_attn(_add_zero_attn),
+      apply_rotary_embedding(_apply_rotary_embedding),
+      qSize(_input->dims[0].size), kSize(_input->dims[0].size),
+      vSize(_input->dims[0].size), qProjSize(_kdim), kProjSize(_kdim),
+      vProjSize(_vdim), oProjSize(_embed_dim),
+      qoSeqLength(_input->dims[1].size), kvSeqLength(_input->dims[1].size),
+      scaling_query(_scaling_query), scaling_factor(_scaling_factor),
+      qk_prod_scaling(_qk_prod_scaling) {
+  // overwrite layer_guid
+  layer_guid = _layer_guid;
+
+  numOutputs = 1;
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+  }
+  dims[0].size = _embed_dim;
+  // Currently require no parallelism along this dim
+  assert(dims[0].degree == 1);
+  if (allocate_weights) {
+    // Create weight tensor
+    int num_dims = inputs[0]->num_dims;
+    // Compute weight size
+    int qParas = this->qProjSize * this->qSize;
+    int kParas = this->kProjSize * this->kSize;
+    int vParas = this->vProjSize * this->vSize;
+    int oParas =
+        this->oProjSize * (this->vProjSize > 0 ? this->vProjSize : this->vSize);
+    ParallelDim dims[2];
+    dims[0] = inputs[0]->dims[num_dims - 2];
+    dims[0].size = dims[0].degree;
+    dims[1] = inputs[0]->dims[num_dims - 1];
+    dims[1].size = this->num_q_heads * (qParas + oParas) +
+                   this->num_kv_heads * (kParas + vParas);
+    dims[1].is_replica_dim = false;
+    int seed = std::rand();
+    Initializer *initializer = new GlorotUniform(seed);
+    weights[0] = model.create_parallel_weight<2>(dims,
+                                                 this->data_type,
+                                                 NULL /*owner_op*/,
+                                                 true /*create_grad*/,
+                                                 initializer,
+                                                 CHOSEN_SYNC_TYPE);
+    if (bias) {
+      ParallelTensorShape bias_shape = _input->get_shape();
+      bias_shape.dims[0].size = qProjSize * num_q_heads +
+                                (kProjSize + vProjSize) * num_kv_heads +
+                                oProjSize;
+      bias_shape.dims[1].size = bias_shape.dims[2].size = 1;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(bias_shape.num_dims,
+                                                       bias_shape.dims,
+                                                       this->data_type,
+                                                       nullptr /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       initializer,
+                                                       CHOSEN_SYNC_TYPE);
+    }
+  }
+
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, dims, this->data_type, this);
+  /* for (int i = 0; i < numdim; i++) { */
+  /*   register_output_input_parallel_dims(outputs[0], i, inputs[0], i); */
+  /* } */
+  /* // Check correctness */
+  /* assert(check_output_input_weight_parallel_dims()); */
+}
+
+SpecIncMultiHeadSelfAttention::SpecIncMultiHeadSelfAttention(
+    FFModel &model,
+    const ParallelTensor _input,
+    const ParallelTensor _weight,
+    int _embed_dim,
+    int _num_q_heads,
+    int _num_kv_heads,
+    int _kdim,
+    int _vdim,
+    float _dropout,
+    bool _bias,
+    bool _add_bias_kv,
+    bool _add_zero_attn,
+    bool _apply_rotary_embedding,
+    bool _scaling_query,
+    float _scaling_factor,
+    bool _qk_prod_scaling,
+    bool allocate_weights,
+    char const *name)
+    // Initializer* _bias_initializer)
+    : Op(model,
+         OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         (_bias ? 2 : 1) /*weights*/,
+         1 /*outputs*/,
+         _input,
+         _weight),
+      num_q_heads(_num_q_heads), num_kv_heads(_num_kv_heads), dropout(_dropout),
+      bias(_bias), add_bias_kv(_add_bias_kv), add_zero_attn(_add_zero_attn),
+      apply_rotary_embedding(_apply_rotary_embedding),
+      qSize(_input->dims[0].size), kSize(_input->dims[0].size),
+      vSize(_input->dims[0].size), qProjSize(_kdim), kProjSize(_kdim),
+      vProjSize(_vdim), oProjSize(_embed_dim),
+      qoSeqLength(_input->dims[1].size), kvSeqLength(_input->dims[1].size),
+      scaling_query(_scaling_query), scaling_factor(_scaling_factor),
+      qk_prod_scaling(_qk_prod_scaling)
+// bias_initializer(_bias_initializer)
+{
+  numOutputs = 1;
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+  }
+  dims[0].size = _embed_dim;
+  // Currently require no parallelism along this dim
+  assert(dims[0].degree == 1);
+  if (allocate_weights) {
+    // Create weight tensor
+    int num_dims = inputs[0]->num_dims;
+    // Compute weight size
+    int qParas = this->qProjSize * this->qSize;
+    int kParas = this->kProjSize * this->kSize;
+    int vParas = this->vProjSize * this->vSize;
+    int oParas =
+        this->oProjSize * (this->vProjSize > 0 ? this->vProjSize : this->vSize);
+    ParallelDim dims[2];
+    dims[0] = inputs[0]->dims[num_dims - 2];
+    dims[0].size = dims[0].degree;
+    dims[1] = inputs[0]->dims[num_dims - 1];
+    dims[1].size = this->num_q_heads * (qParas + oParas) +
+                   this->num_kv_heads * (kParas + vParas);
+    dims[1].is_replica_dim = false;
+    // dims[2].size = qParas + kParas + vParas + oParas;
+    int seed = std::rand();
+    Initializer *initializer = new GlorotUniform(seed);
+    weights[0] = model.create_parallel_weight<2>(dims,
+                                                 this->data_type,
+                                                 NULL /*owner_op*/,
+                                                 true /*create_grad*/,
+                                                 initializer,
+                                                 CHOSEN_SYNC_TYPE);
+    if (bias) {
+      ParallelTensorShape bias_shape = _input->get_shape();
+      bias_shape.dims[0].size = qProjSize * num_q_heads +
+                                (kProjSize + vProjSize) * num_kv_heads +
+                                oProjSize;
+      bias_shape.dims[1].size = bias_shape.dims[2].size = 1;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(bias_shape.num_dims,
+                                                       bias_shape.dims,
+                                                       this->data_type,
+                                                       nullptr /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       initializer,
+                                                       CHOSEN_SYNC_TYPE);
+    }
+  }
+
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, dims, this->data_type, this);
+
+  /* for (int i = 0; i < numdim; i++) { */
+  /*   register_output_input_parallel_dims(outputs[0], i, inputs[0], i); */
+  /* } */
+  /* register_output_weight_parallel_dims(outputs[0], numdim-1, _weight, 1); */
+  /* register_output_weight_parallel_dims(outputs[0], numdim-2, _weight, 2); */
+  // Check correctness
+  /* assert(check_output_input_weight_parallel_dims()); */
+}
+
+SpecIncMultiHeadSelfAttention::SpecIncMultiHeadSelfAttention(
+    FFModel &model,
+    SpecIncMultiHeadSelfAttention const &other,
+    const ParallelTensor input,
+    bool allocate_weights)
+    : SpecIncMultiHeadSelfAttention(model,
+                                    other.layer_guid,
+                                    input,
+                                    other.oProjSize,
+                                    other.num_q_heads,
+                                    other.num_kv_heads,
+                                    other.qProjSize,
+                                    other.vProjSize,
+                                    other.dropout,
+                                    other.bias,
+                                    other.add_bias_kv,
+                                    other.add_zero_attn,
+                                    other.apply_rotary_embedding,
+                                    other.scaling_query,
+                                    other.scaling_factor,
+                                    other.qk_prod_scaling,
+                                    allocate_weights,
+                                    other.name) {}
+
+SpecIncMultiHeadSelfAttention::SpecIncMultiHeadSelfAttention(
+    FFModel &model,
+    SpecIncMultiHeadSelfAttentionParams const &params,
+    ParallelTensor const &input,
+    bool allocate_weights,
+    char const *name)
+    : SpecIncMultiHeadSelfAttention(model,
+                                    params.layer_guid,
+                                    input,
+                                    params.embed_dim,
+                                    params.num_q_heads,
+                                    params.num_kv_heads,
+                                    params.kdim,
+                                    params.vdim,
+                                    params.dropout,
+                                    params.bias,
+                                    params.add_bias_kv,
+                                    params.add_zero_attn,
+                                    params.apply_rotary_embedding,
+                                    params.scaling_query,
+                                    params.scaling_factor,
+                                    params.qk_prod_scaling,
+                                    allocate_weights,
+                                    name) {}
+
+void SpecIncMultiHeadSelfAttention::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(
+      SPEC_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+      parallel_is,
+      TaskArgument(this, sizeof(SpecIncMultiHeadSelfAttention)),
+      argmap,
+      Predicate::TRUE_PRED,
+      false /*must*/,
+      0 /*mapper_id*/,
+      machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void SpecIncMultiHeadSelfAttention::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(
+      SPEC_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+      parallel_is,
+      TaskArgument(this, sizeof(SpecIncMultiHeadSelfAttention)),
+      argmap,
+      Predicate::TRUE_PRED,
+      false /*must*/,
+      0 /*mapper_id*/,
+      outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+/*
+  regions[0](I): input
+  regions[1](I): weight
+  regions[2](O): output
+*/
+OpMeta *SpecIncMultiHeadSelfAttention::init_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  SpecIncMultiHeadSelfAttention const *attn =
+      (SpecIncMultiHeadSelfAttention *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+
+  GenericTensorAccessorR input =
+      helperGetGenericTensorAccessorRO(attn->inputs[0]->data_type,
+                                       regions[0],
+                                       task->regions[0],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  GenericTensorAccessorR weight =
+      helperGetGenericTensorAccessorRO(attn->weights[0]->data_type,
+                                       regions[1],
+                                       task->regions[1],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  GenericTensorAccessorW output =
+      helperGetGenericTensorAccessorWO(attn->outputs[0]->data_type,
+                                       regions[2],
+                                       task->regions[2],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+
+  int num_samples = input.domain.hi()[2] - input.domain.lo()[2] + 1;
+  assert(attn->qoSeqLength == input.domain.hi()[1] - input.domain.lo()[1] + 1);
+  assert(attn->kvSeqLength == input.domain.hi()[1] - input.domain.lo()[1] + 1);
+  int num_q_heads = attn->num_q_heads;
+  int num_kv_heads = attn->num_kv_heads;
+  assert(attn->oProjSize == output.domain.hi()[0] - output.domain.lo()[0] + 1);
+
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  // We don't do offloading for SSMs (small speculative models)
+  SpecIncMultiHeadSelfAttentionMeta *m =
+      new SpecIncMultiHeadSelfAttentionMeta(handle,
+                                            attn,
+                                            weight,
+                                            gpu_mem_allocator,
+                                            num_samples,
+                                            num_q_heads,
+                                            num_kv_heads);
+  // assert that we didn't over allocate memory
+  assert(gpu_mem_allocator.instance_allocated_size ==
+         gpu_mem_allocator.instance_total_size);
+  m->profiling = attn->profiling;
+  assert(weight.domain.get_volume() * data_type_size(weight.data_type) ==
+         m->weightSize);
+  return m;
+}
+
+void SpecIncMultiHeadSelfAttention::forward(FFModel const &ff) {
+  // SpecIncMultiHeadSelfAttention doesn't support forward
+  assert(false);
+}
+
+FutureMap SpecIncMultiHeadSelfAttention::inference(
+    FFModel const &ff,
+    BatchConfigFuture const &bc,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  int idx = 0;
+  IndexLauncher launcher(SPEC_INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+
+  if (bias) {
+    launcher.add_region_requirement(RegionRequirement(weights[1]->part,
+                                                      0 /*projection id*/,
+                                                      READ_ONLY,
+                                                      EXCLUSIVE,
+                                                      weights[1]->region));
+    launcher.add_field(idx++, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+/*
+  regions[0](I): input
+  regions[3](I): weight
+  regions[4](O): output
+*/
+void SpecIncMultiHeadSelfAttention::inference_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(task->regions.size() == regions.size());
+
+  // BeamSearchBatchConfig const *bc = (BeamSearchBatchConfig *)task->args;
+  BeamSearchBatchConfig const &bc =
+      Future(task->futures[0]).get_result<BeamSearchBatchConfig>();
+  if (bc.num_tokens == 0) {
+    return;
+  }
+
+  SpecIncMultiHeadSelfAttentionMeta const *m =
+      *((SpecIncMultiHeadSelfAttentionMeta **)task->local_args);
+  assert((*m->bias ? regions.size() == 4 : regions.size() == 3));
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR weight = helperGetGenericTensorAccessorRO(
+      m->weight_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR biases;
+  if (*m->bias) {
+    biases = helperGetGenericTensorAccessorRO(m->weight_type[1],
+                                              regions[3],
+                                              task->regions[3],
+                                              FID_DATA,
+                                              ctx,
+                                              runtime);
+    Domain bias_domain = runtime->get_index_space_domain(
+        ctx, task->regions[3].region.get_index_space());
+    assert(bias_domain.get_dim() == 4);
+  }
+  Domain input_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  Domain weight_domain = runtime->get_index_space_domain(
+      ctx, task->regions[1].region.get_index_space());
+  Domain output_domain = runtime->get_index_space_domain(
+      ctx, task->regions[2].region.get_index_space());
+
+  assert(input_domain.get_dim() == 4);
+  assert(weight_domain.get_dim() == 2);
+  assert(output_domain.get_dim() == 4);
+
+  assert(task->index_point.get_dim() == 1);
+  SpecIncMultiHeadSelfAttention::inference_kernel_wrapper(
+      m, &bc, task->index_point.point_data[0], input, weight, output, biases);
+
+  // print_tensor<float>(input.get_float_ptr(), 20, "attention input");
+  // print_tensor<float>(output.get_float_ptr(), 20, "attention output");
+  // if(bc.beam_slots.at(0).current_depth == 1){
+  //     print_beam_tensor<float>(input.get_float_ptr(), 50, 4096, 40, "mha topk
+  //     input"); print_beam_tensor<float>(output.get_float_ptr(), 50, 4096, 40,
+  //     "mha topk output");
+  // }
+}
+
+void SpecIncMultiHeadSelfAttention::backward(FFModel const &ff) {
+  // SpecIncMultiHeadSelfAttention does not support backward
+  assert(false);
+}
+
+bool SpecIncMultiHeadSelfAttention::get_int_parameter(PMParameter para,
+                                                      int *value) const {
+  switch (para) {
+    case PM_NUM_HEADS:
+      *value = num_q_heads;
+      return true;
+    default:
+      return Op::get_int_parameter(para, value);
+  }
+}
+
+Op *SpecIncMultiHeadSelfAttention::materialize(FFModel &ff,
+                                               ParallelTensor inputs[],
+                                               int num_inputs) const {
+  SpecIncMultiHeadSelfAttentionParams params = get_params();
+  return new SpecIncMultiHeadSelfAttention(
+      ff, params, inputs[0], true, this->name);
+}
+
+bool SpecIncMultiHeadSelfAttention::measure_operator_cost(
+    Simulator *sim, MachineView const &mv, CostMetrics &cost_metrics) const {
+  return false;
+}
+
+bool operator==(SpecIncMultiHeadSelfAttentionParams const &lhs,
+                SpecIncMultiHeadSelfAttentionParams const &rhs) {
+  return lhs.layer_guid == rhs.layer_guid && lhs.embed_dim == rhs.embed_dim &&
+         lhs.num_q_heads == rhs.num_q_heads && lhs.kdim == rhs.kdim &&
+         lhs.vdim == rhs.vdim && lhs.dropout == rhs.dropout &&
+         lhs.bias == rhs.bias && lhs.add_bias_kv == rhs.add_bias_kv &&
+         lhs.add_zero_attn == rhs.add_zero_attn &&
+         lhs.apply_rotary_embedding == rhs.apply_rotary_embedding &&
+         lhs.scaling_query == rhs.scaling_query &&
+         lhs.scaling_factor == rhs.scaling_factor &&
+         lhs.qk_prod_scaling == rhs.qk_prod_scaling;
+}
+
+SpecIncMultiHeadSelfAttentionParams
+    SpecIncMultiHeadSelfAttention::get_params() const {
+  SpecIncMultiHeadSelfAttentionParams params;
+  params.layer_guid = this->layer_guid;
+  params.embed_dim = this->oProjSize;
+  params.num_q_heads = this->num_q_heads;
+  params.num_kv_heads = this->num_kv_heads;
+  params.kdim = this->kProjSize;
+  params.vdim = this->vProjSize;
+  params.dropout = this->dropout;
+  params.bias = this->bias;
+  params.add_bias_kv = this->add_bias_kv;
+  params.add_zero_attn = this->add_zero_attn;
+  params.apply_rotary_embedding = this->apply_rotary_embedding;
+  params.scaling_query = this->scaling_query;
+  params.scaling_factor = this->scaling_factor;
+  params.qk_prod_scaling = this->qk_prod_scaling;
+
+  return params;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::SpecIncMultiHeadSelfAttentionParams>::operator()(
+    FlexFlow::SpecIncMultiHeadSelfAttentionParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.embed_dim);
+  hash_combine(key, params.num_q_heads);
+  hash_combine(key, params.num_kv_heads);
+  hash_combine(key, params.kdim);
+  hash_combine(key, params.vdim);
+  hash_combine(key, params.dropout);
+  hash_combine(key, params.bias);
+  hash_combine(key, params.add_bias_kv);
+  hash_combine(key, params.add_zero_attn);
+  hash_combine(key, params.apply_rotary_embedding);
+  hash_combine(key, params.scaling_query);
+  hash_combine(key, params.scaling_factor);
+  hash_combine(key, params.qk_prod_scaling);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/spec_inc_multihead_self_attention.cpp b/src/ops/spec_inc_multihead_self_attention.cpp
new file mode 100644
index 0000000000..09198c5751
--- /dev/null
+++ b/src/ops/spec_inc_multihead_self_attention.cpp
@@ -0,0 +1,101 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+
+/*static*/
+void SpecIncMultiHeadSelfAttention::inference_kernel_wrapper(
+    SpecIncMultiHeadSelfAttentionMeta const *m,
+    BeamSearchBatchConfig const *bc,
+    int shard_id,
+    GenericTensorAccessorR const &input,
+    GenericTensorAccessorR const &weight,
+    GenericTensorAccessorW const &output,
+    GenericTensorAccessorR const &bias) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  handle_unimplemented_hip_kernel(OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION);
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+    printf("SpecIncMultiHeadSelfAttention forward time = %.2fms\n", elapsed);
+    // print_tensor<3, float>(acc_query.ptr, acc_query.rect,
+    // "[Attention:forward:query]"); print_tensor<3, float>(acc_output.ptr,
+    // acc_output.rect, "[Attention:forward:output]");
+  }
+}
+
+SpecIncMultiHeadSelfAttentionMeta::SpecIncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    SpecIncMultiHeadSelfAttention const *attn,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _num_q_heads,
+    int _num_kv_heads)
+    : IncMultiHeadSelfAttentionMeta(handler,
+                                    BEAM_SEARCH_MODE,
+                                    attn,
+                                    attn->qSize,
+                                    attn->kSize,
+                                    attn->vSize,
+                                    attn->qProjSize,
+                                    attn->kProjSize,
+                                    attn->vProjSize,
+                                    attn->oProjSize,
+                                    attn->apply_rotary_embedding,
+                                    attn->bias,
+                                    attn->scaling_query,
+                                    attn->qk_prod_scaling,
+                                    attn->add_bias_kv,
+                                    attn->scaling_factor,
+                                    weight,
+                                    gpu_mem_allocator,
+                                    num_samples,
+                                    attn->num_q_heads,
+                                    attn->num_kv_heads,
+                                    _num_q_heads,
+                                    _num_kv_heads,
+                                    DT_NONE,
+                                    false) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(miopenSetStream(handler.dnn, stream));
+}
+
+SpecIncMultiHeadSelfAttentionMeta::~SpecIncMultiHeadSelfAttentionMeta(void) {}
+
+}; // namespace FlexFlow
diff --git a/src/ops/spec_inc_multihead_self_attention.cu b/src/ops/spec_inc_multihead_self_attention.cu
new file mode 100644
index 0000000000..d1faba9c68
--- /dev/null
+++ b/src/ops/spec_inc_multihead_self_attention.cu
@@ -0,0 +1,740 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "cuComplex.h"
+#endif
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/kernels/inc_multihead_self_attention_kernels.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+using namespace Kernels::IncMultiHeadAttention;
+
+namespace Kernels {
+namespace SpecIncMultiHeadAttention {
+
+template <typename DT>
+__global__ void spec_store_kv_cache(
+    DT const *devQKVProjArray,
+    DT *kCache_ptr,
+    DT *vCache_ptr,
+    BatchConfig::PerTokenInfo *tokenInfos,
+    BatchConfig::PerRequestInfo *requestInfo,
+    BeamSearchBatchConfig::BeamSearchPerTokenInfo *beamTokenInfos,
+    BeamSearchBatchConfig::BeamSearchPerRequestInfo *beamRequestInfos,
+    int qProjSize,
+    int kProjSize,
+    int vProjSize,
+    int num_tokens,
+    int num_q_heads,
+    int num_kv_heads,
+    int max_seq_len,
+    int max_beam_width,
+    bool is_root) {
+  CUDA_KERNEL_LOOP(i, num_tokens * (kProjSize + vProjSize) * num_kv_heads) {
+    int q_array_size = qProjSize * num_tokens * num_q_heads;
+    int k_array_size = kProjSize * num_tokens * num_kv_heads;
+
+    bool k_cache = i < k_array_size;
+    int real_i = k_cache ? i : i - k_array_size;
+
+    int proj_size = k_cache ? kProjSize : vProjSize;
+    int head_idx = real_i / (num_tokens * proj_size);
+    int token_idx = (real_i - head_idx * (num_tokens * proj_size)) / proj_size;
+    int data_idx = real_i % proj_size;
+
+    // above no need to be changed
+    // int const req_id = id_map[token_idx].request_index;
+    // int const tok_id = id_map[token_idx].token_position;
+    // int const sub_req_id = id_map[token_idx].sub_request_index;
+    // int const parent_id = id_map[token_idx].parent_id;
+    // int const beam_depth = id_map[token_idx].beam_depth;
+    // int const beam_width = id_map[token_idx].beam_width;
+
+    DT val = devQKVProjArray[q_array_size + (k_cache ? 0 : k_array_size) +
+                             head_idx * proj_size * num_tokens +
+                             token_idx * proj_size + data_idx];
+
+    int const req_id = tokenInfos[token_idx].request_index;
+    int const tok_id = tokenInfos[token_idx].abs_depth_in_request;
+    int const sub_req_id = beamTokenInfos[token_idx].sub_request_index;
+    int const parent_id = beamRequestInfos[req_id].parent_id[sub_req_id];
+    int const beam_depth = beamRequestInfos[req_id].current_depth;
+    int const beam_width = beamRequestInfos[req_id].beam_size;
+
+    // new token
+    int new_token_cache_idx = (req_id * max_beam_width + sub_req_id) *
+                                  (num_kv_heads * max_seq_len * proj_size) +
+                              head_idx * (max_seq_len * proj_size) +
+                              tok_id * proj_size + data_idx;
+
+    DT *cache_ptr = k_cache ? kCache_ptr : vCache_ptr;
+    cache_ptr[new_token_cache_idx] = val;
+
+    // replica in the root iteration
+    if (beam_depth == 1) {
+      for (int i = 1; i < beam_width; i++) {
+        cache_ptr[(req_id * max_beam_width + i) *
+                      (num_kv_heads * max_seq_len * proj_size) +
+                  head_idx * (max_seq_len * proj_size) + tok_id * proj_size +
+                  data_idx] = val;
+      }
+    }
+
+    // if (head_idx == 0 && beam_depth == 0 && token_idx == 8 && k_cache) {
+    //   // printf("token idx %d\n", token_idx);
+    //   printf("data idx: %d, tok_id %d, new_token_cache_idx %d, parent_id %d,
+    //   "
+    //          "sub_req_id %d, num_tokens %d, kProjSize %d, num_kv_heads %d,
+    //          val "
+    //          "%f, beam_width %d\n",
+    //          data_idx,
+    //          tok_id,
+    //          new_token_cache_idx,
+    //          parent_id,
+    //          sub_req_id,
+    //          num_tokens,
+    //          kProjSize,
+    //          num_kv_heads,
+    //          val,
+    //          beam_width);
+    // }
+
+    // naive cache stealing
+    if (sub_req_id != parent_id) {
+      if (data_idx == 0 && head_idx == 0 && k_cache) {
+        printf("cache stealing!, depth %d req_id %d sub_req_id %d, parentid "
+               "%d, tok_id %d\n",
+               beam_depth,
+               req_id,
+               sub_req_id,
+               parent_id,
+               tok_id);
+      }
+
+      for (int depth = 0; depth < beam_depth; depth++) {
+        int steal_token_idx = tok_id - beam_depth + depth;
+        int steal_from_idx = (req_id * max_beam_width + parent_id) *
+                                 (num_kv_heads * max_seq_len * proj_size) +
+                             head_idx * (max_seq_len * proj_size) +
+                             steal_token_idx * proj_size + data_idx;
+        int steal_to_idx = (req_id * max_beam_width + sub_req_id) *
+                               (num_kv_heads * max_seq_len * proj_size) +
+                           head_idx * (max_seq_len * proj_size) +
+                           steal_token_idx * proj_size + data_idx;
+        cache_ptr[steal_to_idx] = cache_ptr[steal_from_idx];
+
+        //   if(data_idx == 0 && head_idx == 0 && k_cache && req_id == 1){
+        //     printf("cache stealing kernel!, steal_token_idx %d\n",
+        //     steal_token_idx);
+        // }
+      }
+    }
+
+    // parallel cache stealing not yet implemented
+    // logic shld be
+    // launch spec_store_kv_cache with parallelism * current depth
+    // from the i here, get depth index
+    // if depth index not the current one, check if we need to steal
+    // steal if needed
+
+    // cache stealing theory
+    // identify which sub request does this token come from
+    // for initial token, 0
+    // for other, may 0,0,1/ 0,1,2/ 1,1,1 to get which cache to be reuse and
+    // which to be delete copy beam_size bunch of blocks when sub_req_id ==
+    // parent_id : like 0 -> 0, 1->1, 2->2, do nothing, just append the new k/v
+  }
+}
+
+template <typename DT>
+void update_kv_cache_kernel(SpecIncMultiHeadSelfAttentionMeta const *m,
+                            BeamSearchBatchConfig const *bc,
+                            cudaStream_t stream) {
+  int num_tokens = bc->num_active_tokens();
+  int curr_depth = bc->beamRequestsInfo[0].current_depth;
+  // printf("curr depth: %d\n", curr_depth);
+  // assert(curr_depth < 3);
+  if (num_tokens > 0) {
+    int parallelism =
+        (m->kProjSize + m->vProjSize) * num_tokens * m->num_kv_heads;
+    spec_store_kv_cache<<<GET_BLOCKS(parallelism),
+                          min(CUDA_NUM_THREADS, parallelism),
+                          0,
+                          stream>>>(static_cast<DT *>(m->devQKVProjArray),
+                                    static_cast<DT *>(m->keyCache),
+                                    static_cast<DT *>(m->valueCache),
+                                    m->token_infos,
+                                    m->request_infos,
+                                    m->beam_token_infos,
+                                    m->beam_request_infos,
+                                    m->qProjSize,
+                                    m->kProjSize,
+                                    m->vProjSize,
+                                    num_tokens,
+                                    m->num_q_heads,
+                                    m->num_kv_heads,
+                                    BatchConfig::MAX_SEQ_LENGTH,
+                                    BeamSearchBatchConfig::MAX_BEAM_WIDTH,
+                                    /*root*/ curr_depth == 0);
+  }
+}
+
+template <typename DT>
+__global__ void spec_fill_entries_above_diagonal(DT *matrix,
+                                                 size_t new_tokens,
+                                                 size_t total_tokens_in_request,
+                                                 size_t num_q_heads,
+                                                 DT value) {
+  CUDA_KERNEL_LOOP(i, new_tokens * total_tokens_in_request * num_q_heads) {
+    // size_t head_idx = i / (new_tokens * total_tokens_in_request);
+    size_t src_idx = (i / new_tokens) % total_tokens_in_request;
+    size_t dst_idx = i % new_tokens + total_tokens_in_request - new_tokens;
+    // Casual Mask
+    if (src_idx > dst_idx) {
+      matrix[i] = value;
+    }
+  }
+}
+
+template <typename DT>
+void compute_attention_kernel(SpecIncMultiHeadSelfAttentionMeta const *m,
+                              BeamSearchBatchConfig const *bc,
+                              int shard_id,
+                              DT *output_ptr,
+                              DT const *bias_ptr,
+                              DT const *weight_ptr,
+                              cudaStream_t stream) {
+  checkCUDA(cublasSetStream(m->handle.blas, stream));
+  checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
+  cudaDataType_t cublas_data_type = ff_to_cuda_datatype(m->output_type[0]);
+  cudnnDataType_t cudnn_data_type = ff_to_cudnn_datatype(m->output_type[0]);
+  assert(data_type_size(m->output_type[0]) == sizeof(DT));
+#if CUDA_VERSION >= 11000
+  // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
+  cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
+#else
+  cudaDataType_t compute_type = cublas_data_type;
+#endif
+  // int num_requests = bc->num_active_requests();
+  int num_tokens = bc->num_active_tokens();
+  int tokens_previous_requests = 0;
+  int tokens_prev_requests_squares = 0;
+  // int qkv_block_size =
+  //     (m->qProjSize + m->kProjSize + m->vProjSize) * num_tokens;
+  int q_block_size = m->qProjSize * num_tokens;
+
+  int kt_block_size = m->kProjSize * BatchConfig::MAX_SEQ_LENGTH;
+  int kt_req_block_size = kt_block_size * m->num_kv_heads;
+  int vt_block_size = m->vProjSize * BatchConfig::MAX_SEQ_LENGTH;
+  int vt_req_block_size = vt_block_size * m->num_kv_heads;
+  assert(m->qProjSize == m->kProjSize);
+
+  for (int i = 0; i < bc->MAX_NUM_REQUESTS; i++) {
+    if (bc->request_completed[i]) {
+      continue;
+    }
+    for (int sub_req_id = 0; sub_req_id < bc->sub_requests[i]; sub_req_id++) {
+
+      // int num_new_tokens = bc->num_processing_tokens[i];
+      // int total_tokens = bc->token_last_available_idx[i] + 1;
+
+      int num_new_tokens = bc->requestsInfo[i].num_tokens_in_batch;
+      int total_tokens = bc->requestsInfo[i].token_start_offset +
+                         bc->requestsInfo[i].num_tokens_in_batch;
+      // Compute (QK^T/sqrt(d_k))
+      int m_ = num_new_tokens;
+      int n = total_tokens;
+      int k = m->qProjSize;
+      int lda = k, ldb = k, ldc = m_;
+      int strideA = q_block_size;
+      int strideB = kt_block_size;
+      int strideC = num_new_tokens * total_tokens;
+
+      // a flag of using this scaling alpha
+      DT alpha = 1.0f, beta = 0.0f;
+      if (*m->qk_prod_scaling) {
+        alpha = static_cast<DT>(1.0f / sqrt(m->kProjSize));
+      }
+      // To get A, skip over Q entries from previous requests (same head)
+      DT const *A = static_cast<DT *>(m->devQKVProjArray) +
+                    tokens_previous_requests * m->qProjSize;
+      // To get B, skip over K entries from previous requests (all heads +
+      // padding)
+      DT const *B = static_cast<DT *>(m->keyCache) +
+                    (i * bc->MAX_BEAM_WIDTH + sub_req_id) * kt_req_block_size;
+
+      // if (i == 0 && sub_req_id == 0 &&
+      //     bc->beam_slots.at(0).current_depth == 1) {
+      //   int offset = (float *)B - m->keyCache;
+      //   printf("key cache offset %d\n", kt_req_block_size);
+      // }
+      // To get C, skip over QK^T products from previous requests
+      DT *C = static_cast<DT *>(m->qk_prods) +
+              m->num_q_heads * tokens_prev_requests_squares;
+
+      if (m->num_q_heads == m->num_kv_heads) {
+        checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                             CUBLAS_OP_T,
+                                             CUBLAS_OP_N,
+                                             m_,
+                                             n,
+                                             k,
+                                             &alpha,
+                                             A,
+                                             cublas_data_type,
+                                             lda,
+                                             strideA,
+                                             B,
+                                             cublas_data_type,
+                                             ldb,
+                                             strideB,
+                                             &beta,
+                                             C,
+                                             cublas_data_type,
+                                             ldc,
+                                             strideC,
+                                             m->num_q_heads,
+                                             compute_type,
+                                             CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      } else {
+        strideB = 0;
+        int one_step_heads = m->num_q_heads / m->num_kv_heads;
+        m_ = num_new_tokens;
+        n = total_tokens;
+        k = m->qProjSize;
+        lda = k, ldb = k, ldc = m_;
+        for (int step = 0; step < m->num_kv_heads; step++) {
+          checkCUDA(
+              cublasGemmStridedBatchedEx(m->handle.blas,
+                                         CUBLAS_OP_T,
+                                         CUBLAS_OP_N,
+                                         m_,
+                                         n,
+                                         k,
+                                         &alpha,
+                                         A + step * strideA * one_step_heads,
+                                         cublas_data_type,
+                                         lda,
+                                         strideA,
+                                         B + step * kt_block_size,
+                                         cublas_data_type,
+                                         ldb,
+                                         strideB,
+                                         &beta,
+                                         C + step * strideC * one_step_heads,
+                                         cublas_data_type,
+                                         ldc,
+                                         strideC,
+                                         one_step_heads,
+                                         compute_type,
+                                         CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+        }
+      }
+
+      // Fill all elements above diagonal in qk prods with -inf to force
+      // causal attention.
+      assert(num_new_tokens <= total_tokens);
+      if (num_new_tokens > 1) {
+        size_t parallelism = m->num_q_heads * num_new_tokens * total_tokens;
+        spec_fill_entries_above_diagonal<<<GET_BLOCKS(parallelism),
+                                           min((size_t)CUDA_NUM_THREADS,
+                                               parallelism),
+                                           0,
+                                           stream>>>(
+            C,
+            num_new_tokens,
+            total_tokens,
+            m->num_q_heads,
+            static_cast<DT>(-INFINITY));
+      }
+      // Compute Softmax(QK^T/sqrt(d_k))
+      // Before modifying the parameters below, make sure to read the following
+      // description of the CUDNN_TENSOR_NCHW tensor layout, from
+      // https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorFormat_t:
+      // This tensor format specifies that the data is laid out in the following
+      // order: batch size, feature maps, rows, columns. The strides are
+      // implicitly defined in such a way that the data are contiguous in memory
+      // with no padding between images, feature maps, rows, and columns; the
+      // columns are the inner dimension and the images are the outermost
+      // dimension.
+      int n_param = m->num_q_heads;
+      int c_param = total_tokens;
+      int h_param = 1;
+      int w_param = num_new_tokens;
+      checkCUDNN(cudnnSetTensor4dDescriptor(m->qk_tensor,
+                                            CUDNN_TENSOR_NCHW,
+                                            cudnn_data_type,
+                                            n_param,
+                                            c_param,
+                                            h_param,
+                                            w_param));
+      float softmax_alpha = 1.0f, softmax_beta = 0.0f;
+      DT *C_softmax = static_cast<DT *>(m->qk_prods_softmax) +
+                      m->num_q_heads * tokens_prev_requests_squares;
+      // The softmax operation below is executed according to the
+      // CUDNN_SOFTMAX_MODE_CHANNEL, which is also described in the docs: The
+      // softmax operation is computed per spatial location (H,W) per image (N)
+      // across dimension C.
+      checkCUDNN(cudnnSoftmaxForward(m->handle.dnn,
+                                     CUDNN_SOFTMAX_ACCURATE,
+                                     CUDNN_SOFTMAX_MODE_CHANNEL,
+                                     &softmax_alpha,
+                                     m->qk_tensor,
+                                     C,
+                                     &softmax_beta,
+                                     m->qk_tensor,
+                                     C_softmax));
+      // Matmul softmax(QK^T/sqrt(d_k)) by V
+      alpha = 1.0f, beta = 0.0f;
+      m_ = num_new_tokens;
+      n = m->vProjSize;
+      k = total_tokens;
+      lda = m_, ldb = n, ldc = m_;
+      strideA = num_new_tokens * total_tokens;
+      strideB = vt_block_size;
+      strideC = num_new_tokens * m->vProjSize;
+      // To get A, skip over softmax(QK^T/sqrt(d_k)) entries from previous
+      // requests (all heads)
+      A = C_softmax;
+      // To get B, skip over V^T entries from previous requests (all heads +
+      // padding)
+      B = static_cast<DT *>(m->valueCache) +
+          (i * bc->MAX_BEAM_WIDTH + sub_req_id) * vt_req_block_size;
+      // To get C, skip over softmax(QK^T/sqrt(d_k))V products from previous
+      // requests
+      C = static_cast<DT *>(m->attn_heads) +
+          tokens_previous_requests * m->num_q_heads * m->vProjSize;
+
+      if (m->num_q_heads == m->num_kv_heads) {
+        checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                             CUBLAS_OP_N,
+                                             CUBLAS_OP_T,
+                                             m_,
+                                             n,
+                                             k,
+                                             &alpha,
+                                             A,
+                                             cublas_data_type,
+                                             lda,
+                                             strideA,
+                                             B,
+                                             cublas_data_type,
+                                             ldb,
+                                             strideB,
+                                             &beta,
+                                             C,
+                                             cublas_data_type,
+                                             ldc,
+                                             strideC,
+                                             m->num_q_heads,
+                                             compute_type,
+                                             CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      } else {
+        int one_step_heads = m->num_q_heads / m->num_kv_heads;
+        n = m->vProjSize;
+        lda = m_, ldb = n, ldc = m_;
+        strideA = num_new_tokens * total_tokens;
+        strideB = 0;
+        strideC = num_new_tokens * m->vProjSize;
+        for (int step = 0; step < m->num_kv_heads; step++) {
+          checkCUDA(
+              cublasGemmStridedBatchedEx(m->handle.blas,
+                                         CUBLAS_OP_N,
+                                         CUBLAS_OP_T,
+                                         m_,
+                                         n,
+                                         k,
+                                         &alpha,
+                                         A + step * one_step_heads * strideA,
+                                         cublas_data_type,
+                                         lda,
+                                         strideA,
+                                         B + step * vt_block_size,
+                                         cublas_data_type,
+                                         ldb,
+                                         strideB,
+                                         &beta,
+                                         C + step * one_step_heads,
+                                         cublas_data_type,
+                                         ldc,
+                                         strideC,
+                                         one_step_heads,
+                                         compute_type,
+                                         CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+        }
+      }
+
+      // Project to output, save result directly on output tensor
+      alpha = 1.0f, beta = 0.0f;
+      m_ = m->oProjSize;
+      k = m->vProjSize * m->num_q_heads;
+      n = num_new_tokens;
+      lda = k, ldb = n, ldc = m_;
+      A = weight_ptr + m->qSize * (m->qProjSize * m->num_q_heads +
+                                   m->kProjSize * m->num_kv_heads +
+                                   m->vProjSize * m->num_kv_heads);
+      B = C;
+      C = static_cast<DT *>(output_ptr) +
+          tokens_previous_requests * m->oProjSize;
+
+      checkCUDA(cublasGemmEx(m->handle.blas,
+                             CUBLAS_OP_T,
+                             CUBLAS_OP_T,
+                             m_,
+                             n,
+                             k,
+                             &alpha,
+                             A,
+                             cublas_data_type,
+                             lda,
+                             B,
+                             cublas_data_type,
+                             ldb,
+                             &beta,
+                             C,
+                             cublas_data_type,
+                             ldc,
+                             compute_type,
+                             CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      tokens_previous_requests += num_new_tokens;
+      tokens_prev_requests_squares += num_new_tokens * total_tokens;
+    }
+  }
+  if (*m->bias && shard_id == 0) {
+    int parallelism = m->oProjSize * num_tokens;
+    int qkv_weight_size = m->qProjSize * m->global_num_q_heads +
+                          m->kProjSize * m->global_num_kv_heads +
+                          m->vProjSize * m->global_num_kv_heads;
+    apply_proj_bias_w<<<GET_BLOCKS(parallelism),
+                        min(CUDA_NUM_THREADS, parallelism),
+                        0,
+                        stream>>>(
+        output_ptr, bias_ptr, num_tokens, qkv_weight_size, m->oProjSize);
+  }
+
+  assert(tokens_previous_requests == num_tokens);
+}
+
+template <typename DT>
+void inference_kernel(SpecIncMultiHeadSelfAttentionMeta const *m,
+                      BeamSearchBatchConfig const *bc,
+                      int shard_id,
+                      DT const *input_ptr,
+                      DT const *weight_ptr,
+                      DT *output_ptr,
+                      DT const *bias_ptr,
+                      cudaStream_t stream) {
+  // here because we need postion info in infernece 1
+  cudaMemcpyAsync(m->token_infos,
+                  &(bc->tokensInfo),
+                  bc->MAX_NUM_TOKENS * sizeof(BatchConfig::PerTokenInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  cudaMemcpyAsync(m->request_infos,
+                  &(bc->requestsInfo),
+                  bc->MAX_NUM_REQUESTS * sizeof(BatchConfig::PerRequestInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  cudaMemcpyAsync(m->beam_token_infos,
+                  &(bc->beamTokenInfo),
+                  bc->MAX_NUM_TOKENS * bc->MAX_BEAM_WIDTH *
+                      sizeof(BeamSearchBatchConfig::BeamSearchPerTokenInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  cudaMemcpyAsync(m->beam_request_infos,
+                  &(bc->beamRequestsInfo),
+                  bc->MAX_NUM_REQUESTS *
+                      sizeof(BeamSearchBatchConfig::BeamSearchPerRequestInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  // phase 1: Implement kernel to compute KQV for input tokens
+  compute_qkv_kernel(m,
+                     bc,
+                     shard_id,
+                     input_ptr,
+                     weight_ptr,
+                     static_cast<DT *>(m->devQKVProjArray),
+                     bias_ptr,
+                     stream);
+  // phase 2: Update key/val cache
+  update_kv_cache_kernel<DT>(m, bc, stream);
+
+  // phase 3: Compute attention score
+  // 3 kernels for pahse 3: matmul1 - softmax - matmal2
+  compute_attention_kernel(
+      m, bc, shard_id, output_ptr, bias_ptr, weight_ptr, stream);
+}
+
+} // namespace SpecIncMultiHeadAttention
+} // namespace Kernels
+
+/*static*/
+void SpecIncMultiHeadSelfAttention::inference_kernel_wrapper(
+    SpecIncMultiHeadSelfAttentionMeta const *m,
+    BeamSearchBatchConfig const *bc,
+    int shard_id,
+    GenericTensorAccessorR const &input,
+    GenericTensorAccessorR const &weight,
+    GenericTensorAccessorW const &output,
+    GenericTensorAccessorR const &bias) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  bool use_bias = *m->bias;
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  assert(input.data_type == weight.data_type);
+  assert(input.data_type == output.data_type);
+  if (use_bias) {
+    assert(input.data_type == bias.data_type);
+  }
+
+  if (input.data_type == DT_HALF) {
+    half const *bias_ptr =
+        use_bias ? bias.get_half_ptr() : static_cast<half const *>(nullptr);
+    Kernels::SpecIncMultiHeadAttention::inference_kernel(m,
+                                                         bc,
+                                                         shard_id,
+                                                         input.get_half_ptr(),
+                                                         weight.get_half_ptr(),
+                                                         output.get_half_ptr(),
+                                                         bias_ptr,
+                                                         stream);
+  } else if (input.data_type == DT_FLOAT) {
+    float const *bias_ptr =
+        use_bias ? bias.get_float_ptr() : static_cast<float const *>(nullptr);
+    Kernels::SpecIncMultiHeadAttention::inference_kernel(m,
+                                                         bc,
+                                                         shard_id,
+                                                         input.get_float_ptr(),
+                                                         weight.get_float_ptr(),
+                                                         output.get_float_ptr(),
+                                                         bias_ptr,
+                                                         stream);
+  } else {
+    assert(false && "Unspported data type");
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("SpecIncMultiHeadSelfAttention forward time = %.2fms\n", elapsed);
+    // print_tensor<3, float>(acc_query.ptr, acc_query.rect,
+    // "[Attention:forward:query]"); print_tensor<3, float>(acc_output.ptr,
+    // acc_output.rect, "[Attention:forward:output]");
+  }
+}
+
+SpecIncMultiHeadSelfAttentionMeta::SpecIncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    SpecIncMultiHeadSelfAttention const *attn,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _num_q_heads,
+    int _num_kv_heads)
+    : IncMultiHeadSelfAttentionMeta(handler,
+                                    BEAM_SEARCH_MODE,
+                                    attn,
+                                    attn->qSize,
+                                    attn->kSize,
+                                    attn->vSize,
+                                    attn->qProjSize,
+                                    attn->kProjSize,
+                                    attn->vProjSize,
+                                    attn->oProjSize,
+                                    attn->apply_rotary_embedding,
+                                    attn->bias,
+                                    attn->scaling_query,
+                                    attn->qk_prod_scaling,
+                                    attn->add_bias_kv,
+                                    attn->scaling_factor,
+                                    weight,
+                                    gpu_mem_allocator,
+                                    num_samples,
+                                    attn->num_q_heads,
+                                    attn->num_kv_heads,
+                                    _num_q_heads,
+                                    _num_kv_heads,
+                                    DT_NONE,
+                                    false) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(cudnnSetStream(handler.dnn, stream));
+
+  // allocate memory for the seqArray and reserve space
+  {
+    size_t beam_tokeninfo_size = BeamSearchBatchConfig::MAX_NUM_TOKENS *
+                                 BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+    size_t requestinfo_size = BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+    size_t beam_requestinfo_size = BeamSearchBatchConfig::MAX_NUM_REQUESTS;
+    size_t total_size =
+        requestinfo_size * sizeof(BatchConfig::PerRequestInfo) +
+        beam_tokeninfo_size *
+            sizeof(BeamSearchBatchConfig::BeamSearchPerTokenInfo) +
+        beam_requestinfo_size *
+            sizeof(BeamSearchBatchConfig::
+                       BeamSearchPerRequestInfo); // more components will
+                                                  // be added here later
+
+    // We always directly allocate memory for small speculative models
+    gpu_mem_allocator.create_legion_instance(beam_search_reserve_inst,
+                                             total_size);
+    beam_token_infos =
+        gpu_mem_allocator
+            .allocate_instance<BeamSearchBatchConfig::BeamSearchPerTokenInfo>(
+                beam_tokeninfo_size);
+    // offset += beam_tokeninfo_size *
+    //           sizeof(BeamSearchBatchConfig::BeamSearchPerTokenInfo);
+    request_infos =
+        gpu_mem_allocator.allocate_instance<BatchConfig::PerRequestInfo>(
+            requestinfo_size);
+    // offset += requestinfo_size * sizeof(BatchConfig::PerRequestInfo);
+    beam_request_infos =
+        gpu_mem_allocator
+            .allocate_instance<BeamSearchBatchConfig::BeamSearchPerRequestInfo>(
+                beam_requestinfo_size);
+    // offset += beam_requestinfo_size *
+    //           sizeof(BeamSearchBatchConfig::BeamSearchPerRequestInfo);
+    // assert(offset == total_size);
+    assert(gpu_mem_allocator.instance_total_size ==
+           gpu_mem_allocator.instance_allocated_size);
+  }
+
+  cudaStreamSynchronize(stream);
+}
+
+SpecIncMultiHeadSelfAttentionMeta::~SpecIncMultiHeadSelfAttentionMeta(void) {
+  if (beam_search_reserve_inst != Realm::RegionInstance::NO_INST) {
+    beam_search_reserve_inst.destroy();
+  }
+}
+
+}; // namespace FlexFlow
diff --git a/src/ops/split.cc b/src/ops/split.cc
index 4f60cb96f0..9298850a99 100644
--- a/src/ops/split.cc
+++ b/src/ops/split.cc
@@ -170,6 +170,47 @@ void Split::init(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+void Split::init_inference(FFModel const &ff,
+                           std::vector<ParallelTensor> const &batch_inputs,
+                           std::vector<ParallelTensor> const &batch_outputs,
+                           MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+
+  IndexLauncher launcher(SPLIT_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Split)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  for (int i = 0; i < numOutputs; i++) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[i]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[i]->region));
+    launcher.add_field(i + 1, FID_DATA);
+  }
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 OpMeta *Split::init_task(Task const *task,
                          std::vector<PhysicalRegion> const &regions,
                          Context ctx,
@@ -205,6 +246,45 @@ void Split::forward(FFModel const &ff) {
   }
   runtime->execute_index_space(ctx, launcher);
 }
+FutureMap Split::inference(FFModel const &ff,
+                           BatchConfigFuture const &bc,
+                           std::vector<ParallelTensor> const &batch_inputs,
+                           std::vector<ParallelTensor> const &batch_outputs,
+                           MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+
+  IndexLauncher launcher(SPLIT_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Split)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  for (int i = 0; i < numOutputs; i++) {
+    launcher.add_region_requirement(
+        RegionRequirement(batch_outputs[i]->part,
+                          0 /*projection id*/,
+                          WRITE_ONLY,
+                          EXCLUSIVE,
+                          batch_outputs[i]->region));
+    launcher.add_field(i + 1, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
 
 void calc_block_size(coord_t &num_blks,
                      coord_t &blk_size,
diff --git a/src/ops/topk.cc b/src/ops/topk.cc
index 1a87c6c80c..d76ad75167 100644
--- a/src/ops/topk.cc
+++ b/src/ops/topk.cc
@@ -136,6 +136,49 @@ TopK::TopK(FFModel &model,
            char const *name)
     : TopK(model, input, params.k, params.sorted, name) {}
 
+void TopK::init_inference(FFModel const &ff,
+                          std::vector<ParallelTensor> const &batch_inputs,
+                          std::vector<ParallelTensor> const &batch_outputs,
+                          MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(TOPK_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(TopK)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[1]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void TopK::init(FFModel const &ff) {
   assert(check_output_input_weight_same_parallel_is());
   parallel_is = outputs[0]->parallel_is;
@@ -220,6 +263,49 @@ void TopK::forward(FFModel const &ff) {
   runtime->execute_index_space(ctx, launcher);
 }
 
+FutureMap TopK::inference(FFModel const &ff,
+                          BatchConfigFuture const &bc,
+                          std::vector<ParallelTensor> const &batch_inputs,
+                          std::vector<ParallelTensor> const &batch_outputs,
+                          MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  /* std::cout << "TopK op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(TOPK_FWD_TASK_ID,
+                         parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[1]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[1]->region));
+  launcher.add_field(2, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void TopK::forward_task(Task const *task,
                         std::vector<PhysicalRegion> const &regions,
                         Context ctx,
diff --git a/src/ops/tree_inc_multihead_self_attention.cc b/src/ops/tree_inc_multihead_self_attention.cc
new file mode 100644
index 0000000000..875f38c77a
--- /dev/null
+++ b/src/ops/tree_inc_multihead_self_attention.cc
@@ -0,0 +1,1706 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/model.h"
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "flexflow/utils/cuda_helper.h"
+#else
+#include "flexflow/utils/hip_helper.h"
+#endif
+#include "flexflow/utils/hash_utils.h"
+#include "legion/legion_utilities.h"
+#ifdef INFERENCE_TESTS
+#include <torch/torch.h>
+using namespace at::indexing;
+#endif
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+using PCG::Node;
+
+LegionRuntime::Logger::Category log_tree_verify("TreeVerifyIncMHA");
+
+bool TreeIncMultiHeadSelfAttentionParams::is_valid(
+    ParallelTensorShape const &input) const {
+  bool is_valid = input.is_valid();
+  return is_valid;
+}
+
+Tensor FFModel::inc_multihead_self_attention_verify(
+    const Tensor input,
+    int embed_dim,
+    int num_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    DataType data_type,
+    Initializer *kernel_initializer,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  return inc_multiquery_self_attention_verify(input,
+                                              embed_dim,
+                                              num_heads,
+                                              num_heads,
+                                              kdim,
+                                              vdim,
+                                              dropout,
+                                              bias,
+                                              add_bias_kv,
+                                              add_zero_attn,
+                                              data_type,
+                                              kernel_initializer,
+                                              apply_rotary_embedding,
+                                              scaling_query,
+                                              scaling_factor,
+                                              qk_prod_scaling,
+                                              name);
+}
+
+Tensor FFModel::inc_multiquery_self_attention_verify(
+    const Tensor input,
+    int embed_dim,
+    int num_q_heads,
+    int num_kv_heads,
+    int kdim,
+    int vdim,
+    float dropout,
+    bool bias,
+    bool add_bias_kv,
+    bool add_zero_attn,
+    DataType data_type,
+    Initializer *kernel_initializer,
+    bool apply_rotary_embedding,
+    bool scaling_query,
+    float scaling_factor,
+    bool qk_prod_scaling,
+    char const *name) {
+  if (data_type == DT_NONE) {
+    data_type = input->data_type;
+  }
+  DataType quantization_type = cpu_offload ? config.quantization_type : DT_NONE;
+  bool offload = cpu_offload;
+  Layer *li = nullptr;
+  int weight_num = bias ? 2 : 1;
+  if (data_type != input->data_type) {
+    Tensor casted_input = cast(input, data_type, "type cast for IncMHA");
+    li = new Layer(this,
+                   OP_TREE_INC_MULTIHEAD_SELF_ATTENTION,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   weight_num /*weights*/,
+                   1 /*outputs*/,
+                   casted_input);
+  } else {
+    li = new Layer(this,
+                   OP_TREE_INC_MULTIHEAD_SELF_ATTENTION,
+                   data_type,
+                   name,
+                   1 /*inputs*/,
+                   weight_num /*weights*/,
+                   1 /*outputs*/,
+                   input);
+  }
+  {
+    int numdims = input->num_dims;
+    int dims[MAX_TENSOR_DIM];
+    for (int i = 0; i < numdims; i++) {
+      dims[i] = input->dims[i];
+    }
+    dims[0] = embed_dim;
+    li->outputs[0] = create_tensor_legion_ordering(
+        numdims, dims, data_type, li, 0, true /*create_grad*/);
+  }
+  // Compute weight size
+  int qProjSize = kdim, kProjSize = kdim, vProjSize = kdim,
+      oProjSize = embed_dim;
+  int qSize = input->dims[0], kSize = input->dims[0], vSize = input->dims[0];
+  int qParas = qProjSize * qSize;
+  int kParas = kProjSize * kSize;
+  int vParas = vProjSize * vSize;
+  int oParas = oProjSize * (vProjSize > 0 ? vProjSize : vSize);
+  int one_head_size = qParas + kParas + vParas + oParas;
+  int weight_size = qParas * num_q_heads + kParas * num_kv_heads +
+                    vParas * num_kv_heads + oParas * num_q_heads;
+  {
+    // compress the weight size if quantization.
+    if (quantization_type != DT_NONE) {
+      one_head_size = get_quantization_to_byte_size(
+          data_type, quantization_type, one_head_size);
+    }
+
+    int dims[1] = {weight_size};
+    li->weights[0] = create_weight_legion_ordering(
+        1,
+        dims,
+        quantization_type == DT_NONE ? data_type : quantization_type,
+        li,
+        true /*create_grad*/,
+        kernel_initializer,
+        CHOSEN_SYNC_TYPE);
+  }
+  if (bias) {
+    // q, k, v, o
+    int dims[1] = {qProjSize * num_q_heads +
+                   (kProjSize + vProjSize) * num_kv_heads + oProjSize};
+    li->weights[1] = create_weight_legion_ordering(1,
+                                                   dims,
+                                                   data_type,
+                                                   li,
+                                                   true /*create_grad*/,
+                                                   kernel_initializer,
+                                                   CHOSEN_SYNC_TYPE);
+  }
+  li->data_type = data_type;
+  li->add_int_property("embed_dim", embed_dim);
+  li->add_int_property("num_q_heads", num_q_heads);
+  li->add_int_property("num_kv_heads", num_kv_heads);
+  li->add_int_property("kdim", kdim);
+  li->add_int_property("vdim", vdim);
+  li->add_int_property("bias", bias);
+  li->add_int_property("add_bias_kv", add_bias_kv);
+  li->add_int_property("add_zero_attn", add_zero_attn);
+  li->add_float_property("dropout", dropout);
+  li->add_int_property("apply_rotary_embedding", apply_rotary_embedding);
+  li->add_int_property("scaling_query", scaling_query);
+  li->add_float_property("scaling_factor", scaling_factor);
+  li->add_int_property("qk_prod_scaling", qk_prod_scaling);
+  li->add_int_property("quantization_type", quantization_type);
+  li->add_int_property("offload", offload);
+  li->add_int_property("tensor_parallelism_degree",
+                       config.tensor_parallelism_degree);
+  layers.push_back(li);
+  return li->outputs[0];
+}
+
+Op *TreeIncMultiHeadSelfAttention::create_operator_from_layer(
+    FFModel &model,
+    Layer const *layer,
+    std::vector<ParallelTensor> const &inputs) {
+  long long value;
+  layer->get_int_property("embed_dim", value);
+  int embed_dim = value;
+  layer->get_int_property("num_q_heads", value);
+  int num_q_heads = value;
+  layer->get_int_property("num_kv_heads", value);
+  int num_kv_heads = value;
+  layer->get_int_property("kdim", value);
+  int kdim = value;
+  layer->get_int_property("vdim", value);
+  int vdim = value;
+  float dropout;
+  layer->get_float_property("dropout", dropout);
+  layer->get_int_property("bias", value);
+  bool bias = (bool)value;
+  layer->get_int_property("add_bias_kv", value);
+  bool add_bias_kv = (bool)value;
+  layer->get_int_property("add_zero_attn", value);
+  bool add_zero_attn = (bool)value;
+  layer->get_int_property("apply_rotary_embedding", value);
+  bool apply_rotary_embedding = (bool)value;
+  layer->get_int_property("scaling_query", value);
+  bool scaling_query = (bool)value;
+  float scaling_factor;
+  layer->get_float_property("scaling_factor", scaling_factor);
+  layer->get_int_property("qk_prod_scaling", value);
+  bool qk_prod_scaling = (bool)value;
+  layer->get_int_property("quantization_type", value);
+  DataType quantization_type = (DataType)value;
+  layer->get_int_property("offload", value);
+  bool offload = (bool)value;
+  layer->get_int_property("tensor_parallelism_degree", value);
+  int tensor_parallelism_degree = (int)value;
+  return new TreeIncMultiHeadSelfAttention(model,
+                                           layer->layer_guid,
+                                           inputs[0],
+                                           embed_dim,
+                                           num_q_heads,
+                                           num_kv_heads,
+                                           kdim,
+                                           vdim,
+                                           dropout,
+                                           bias,
+                                           add_bias_kv,
+                                           add_zero_attn,
+                                           apply_rotary_embedding,
+                                           scaling_query,
+                                           scaling_factor,
+                                           qk_prod_scaling,
+                                           false /*allocate_weights*/,
+                                           quantization_type,
+                                           offload,
+                                           tensor_parallelism_degree,
+                                           layer->name);
+}
+
+TreeIncMultiHeadSelfAttention::TreeIncMultiHeadSelfAttention(
+    FFModel &model,
+    LayerID const &_layer_guid,
+    const ParallelTensor _input,
+    int _embed_dim,
+    int _num_q_heads,
+    int _num_kv_heads,
+    int _kdim,
+    int _vdim,
+    float _dropout,
+    bool _bias,
+    bool _add_bias_kv,
+    bool _add_zero_attn,
+    bool _apply_rotary_embedding,
+    bool _scaling_query,
+    float _scaling_factor,
+    bool _qk_prod_scaling,
+    bool allocate_weights,
+    DataType _quantization_type,
+    bool _offload,
+    int _tensor_parallelism_degree,
+    char const *name)
+    // Initializer* _bias_initializer)
+    : Op(model,
+         OP_TREE_INC_MULTIHEAD_SELF_ATTENTION,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         (_bias ? 2 : 1) /*weights*/,
+         1 /*outputs*/,
+         _input),
+      num_q_heads(_num_q_heads), num_kv_heads(_num_kv_heads), dropout(_dropout),
+      bias(_bias), add_bias_kv(_add_bias_kv), add_zero_attn(_add_zero_attn),
+      apply_rotary_embedding(_apply_rotary_embedding),
+      qSize(_input->dims[0].size), kSize(_input->dims[0].size),
+      vSize(_input->dims[0].size), qProjSize(_kdim), kProjSize(_kdim),
+      vProjSize(_vdim), oProjSize(_embed_dim),
+      qoSeqLength(_input->dims[1].size), kvSeqLength(_input->dims[1].size),
+      scaling_query(_scaling_query), scaling_factor(_scaling_factor),
+      qk_prod_scaling(_qk_prod_scaling), quantization_type(_quantization_type),
+      offload(_offload), tensor_parallelism_degree(_tensor_parallelism_degree) {
+  // overwrite layer_guid
+  layer_guid = _layer_guid;
+
+  numOutputs = 1;
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+  }
+  dims[0].size = _embed_dim;
+  // Currently require no parallelism along this dim
+  assert(dims[0].degree == 1);
+  if (allocate_weights) {
+    // Create weight tensor
+    int num_dims = inputs[0]->num_dims;
+    // Compute weight size
+    int qParas = this->qProjSize * this->qSize;
+    int kParas = this->kProjSize * this->kSize;
+    int vParas = this->vProjSize * this->vSize;
+    int oParas =
+        this->oProjSize * (this->vProjSize > 0 ? this->vProjSize : this->vSize);
+    ParallelDim dims[2];
+    dims[0] = inputs[0]->dims[num_dims - 2];
+    dims[0].size = dims[0].degree;
+    dims[1] = inputs[0]->dims[num_dims - 1];
+    dims[1].size = this->num_q_heads * (qParas + oParas) +
+                   this->num_kv_heads * (kParas + vParas);
+    dims[1].is_replica_dim = false;
+    // dims[2].size = qParas + kParas + vParas + oParas;
+    if (quantization_type != DT_NONE) {
+      dims[1].size = get_quantization_to_byte_size(
+          data_type, quantization_type, dims[2].size);
+    }
+    // dims[2].degree = 1;
+    // dims[2].parallel_idx = -1;
+    int seed = std::rand();
+    Initializer *initializer = new GlorotUniform(seed);
+    weights[0] = model.create_parallel_weight<2>(
+        dims,
+        quantization_type == DT_NONE ? this->data_type : quantization_type,
+        NULL /*owner_op*/,
+        true /*create_grad*/,
+        initializer,
+        CHOSEN_SYNC_TYPE);
+    if (bias) {
+      ParallelTensorShape bias_shape = _input->get_shape();
+      bias_shape.dims[0].size = qProjSize * num_q_heads +
+                                (kProjSize + vProjSize) * num_kv_heads +
+                                oProjSize;
+      bias_shape.dims[1].size = bias_shape.dims[2].size = 1;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(bias_shape.num_dims,
+                                                       bias_shape.dims,
+                                                       this->data_type,
+                                                       nullptr /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       initializer,
+                                                       CHOSEN_SYNC_TYPE);
+    }
+  }
+
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, dims, this->data_type, this);
+  /* for (int i = 0; i < numdim; i++) { */
+  /*   register_output_input_parallel_dims(outputs[0], i, inputs[0], i); */
+  /* } */
+  /* // Check correctness */
+  /* assert(check_output_input_weight_parallel_dims()); */
+}
+
+TreeIncMultiHeadSelfAttention::TreeIncMultiHeadSelfAttention(
+    FFModel &model,
+    const ParallelTensor _input,
+    const ParallelTensor _weight,
+    int _embed_dim,
+    int _num_q_heads,
+    int _num_kv_heads,
+    int _kdim,
+    int _vdim,
+    float _dropout,
+    bool _bias,
+    bool _add_bias_kv,
+    bool _add_zero_attn,
+    bool _apply_rotary_embedding,
+    bool _scaling_query,
+    float _scaling_factor,
+    bool _qk_prod_scaling,
+    bool allocate_weights,
+    DataType _quantization_type,
+    bool _offload,
+    int _tensor_parallelism_degree,
+    char const *name)
+    // Initializer* _bias_initializer)
+    : Op(model,
+         OP_TREE_INC_MULTIHEAD_SELF_ATTENTION,
+         _input->data_type,
+         name,
+         1 /*inputs*/,
+         (_bias ? 2 : 1) /*weights*/,
+         1 /*outputs*/,
+         _input,
+         _weight),
+      num_q_heads(_num_q_heads), num_kv_heads(_num_kv_heads), dropout(_dropout),
+      bias(_bias), add_bias_kv(_add_bias_kv), add_zero_attn(_add_zero_attn),
+      apply_rotary_embedding(_apply_rotary_embedding),
+      qSize(_input->dims[0].size), kSize(_input->dims[0].size),
+      vSize(_input->dims[0].size), qProjSize(_kdim), kProjSize(_kdim),
+      vProjSize(_vdim), oProjSize(_embed_dim),
+      qoSeqLength(_input->dims[1].size), kvSeqLength(_input->dims[1].size),
+      scaling_query(_scaling_query), scaling_factor(_scaling_factor),
+      qk_prod_scaling(_qk_prod_scaling), quantization_type(_quantization_type),
+      offload(_offload), tensor_parallelism_degree(_tensor_parallelism_degree)
+// bias_initializer(_bias_initializer)
+{
+  numOutputs = 1;
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+  }
+  dims[0].size = _embed_dim;
+  // Currently require no parallelism along this dim
+  assert(dims[0].degree == 1);
+  if (allocate_weights) {
+    // Create weight tensor
+    int num_dims = inputs[0]->num_dims;
+    // Compute weight size
+    int qParas = this->qProjSize * this->qSize;
+    int kParas = this->kProjSize * this->kSize;
+    int vParas = this->vProjSize * this->vSize;
+    int oParas =
+        this->oProjSize * (this->vProjSize > 0 ? this->vProjSize : this->vSize);
+    ParallelDim dims[2];
+    dims[0] = inputs[0]->dims[num_dims - 2];
+    dims[0].size = dims[0].degree;
+    dims[1] = inputs[0]->dims[num_dims - 1];
+    dims[1].size = this->num_q_heads * (qParas + oParas) +
+                   this->num_kv_heads * (kParas + vParas);
+    dims[1].is_replica_dim = false;
+    // dims[2].size = qParas + kParas + vParas + oParas;
+    if (quantization_type != DT_NONE) {
+      dims[1].size = get_quantization_to_byte_size(
+          data_type, quantization_type, dims[2].size);
+    }
+    int seed = std::rand();
+    Initializer *initializer = new GlorotUniform(seed);
+    weights[0] = model.create_parallel_weight<2>(
+        dims,
+        quantization_type == DT_NONE ? this->data_type : quantization_type,
+        NULL /*owner_op*/,
+        true /*create_grad*/,
+        initializer,
+        CHOSEN_SYNC_TYPE);
+    if (bias) {
+      ParallelTensorShape bias_shape = _input->get_shape();
+      bias_shape.dims[0].size = qProjSize * num_q_heads +
+                                (kProjSize + vProjSize) * num_kv_heads +
+                                oProjSize;
+      bias_shape.dims[1].size = bias_shape.dims[2].size = 1;
+      weights[1] =
+          model.create_parallel_weight_legion_ordering(bias_shape.num_dims,
+                                                       bias_shape.dims,
+                                                       this->data_type,
+                                                       nullptr /*owner_op*/,
+                                                       true /*create_grad*/,
+                                                       initializer,
+                                                       CHOSEN_SYNC_TYPE);
+    }
+  }
+
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      _input->num_dims, dims, this->data_type, this);
+
+  /* for (int i = 0; i < numdim; i++) { */
+  /*   register_output_input_parallel_dims(outputs[0], i, inputs[0], i); */
+  /* } */
+  /* register_output_weight_parallel_dims(outputs[0], numdim-1, _weight, 1); */
+  /* register_output_weight_parallel_dims(outputs[0], numdim-2, _weight, 2); */
+  // Check correctness
+  /* assert(check_output_input_weight_parallel_dims()); */
+}
+
+TreeIncMultiHeadSelfAttention::TreeIncMultiHeadSelfAttention(
+    FFModel &model,
+    TreeIncMultiHeadSelfAttention const &other,
+    const ParallelTensor input,
+    bool allocate_weights)
+    : TreeIncMultiHeadSelfAttention(model,
+                                    other.layer_guid,
+                                    input,
+                                    other.oProjSize,
+                                    other.num_q_heads,
+                                    other.num_kv_heads,
+                                    other.qProjSize,
+                                    other.vProjSize,
+                                    other.dropout,
+                                    other.bias,
+                                    other.add_bias_kv,
+                                    other.add_zero_attn,
+                                    other.apply_rotary_embedding,
+                                    other.scaling_query,
+                                    other.scaling_factor,
+                                    other.qk_prod_scaling,
+                                    allocate_weights,
+                                    other.quantization_type,
+                                    other.offload,
+                                    other.tensor_parallelism_degree,
+                                    other.name) {}
+
+TreeIncMultiHeadSelfAttention::TreeIncMultiHeadSelfAttention(
+    FFModel &model,
+    TreeIncMultiHeadSelfAttentionParams const &params,
+    ParallelTensor const &input,
+    bool allocate_weights,
+    char const *name)
+    : TreeIncMultiHeadSelfAttention(model,
+                                    params.layer_guid,
+                                    input,
+                                    params.embed_dim,
+                                    params.num_q_heads,
+                                    params.num_kv_heads,
+                                    params.kdim,
+                                    params.vdim,
+                                    params.dropout,
+                                    params.bias,
+                                    params.add_bias_kv,
+                                    params.add_zero_attn,
+                                    params.apply_rotary_embedding,
+                                    params.scaling_query,
+                                    params.scaling_factor,
+                                    params.qk_prod_scaling,
+                                    allocate_weights,
+                                    params.quantization_type,
+                                    params.offload,
+                                    params.tensor_parallelism_degree,
+                                    name) {}
+
+void TreeIncMultiHeadSelfAttention::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  size_t machine_view_hash = view->hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(
+      TREE_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+      parallel_is,
+      TaskArgument(this, sizeof(TreeIncMultiHeadSelfAttention)),
+      argmap,
+      Predicate::TRUE_PRED,
+      false /*must*/,
+      0 /*mapper_id*/,
+      machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(weights[0]->part,
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        weights[0]->region,
+                        ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+void TreeIncMultiHeadSelfAttention::init(FFModel const &ff) {
+  assert(check_output_input_weight_same_parallel_is());
+  parallel_is = outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(
+      TREE_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+      parallel_is,
+      TaskArgument(this, sizeof(TreeIncMultiHeadSelfAttention)),
+      argmap,
+      Predicate::TRUE_PRED,
+      false /*must*/,
+      0 /*mapper_id*/,
+      outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(weights[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    weights[0]->region));
+  launcher.add_field(1, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(2, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+/*
+  regions[0](I): input
+  regions[1](I): weight
+  regions[2](O): output
+*/
+OpMeta *TreeIncMultiHeadSelfAttention::init_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  TreeIncMultiHeadSelfAttention const *attn =
+      (TreeIncMultiHeadSelfAttention *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+
+  GenericTensorAccessorR input =
+      helperGetGenericTensorAccessorRO(attn->inputs[0]->data_type,
+                                       regions[0],
+                                       task->regions[0],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  GenericTensorAccessorR weight =
+      helperGetGenericTensorAccessorRO(attn->weights[0]->data_type,
+                                       regions[1],
+                                       task->regions[1],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+  GenericTensorAccessorW output =
+      helperGetGenericTensorAccessorWO(attn->outputs[0]->data_type,
+                                       regions[2],
+                                       task->regions[2],
+                                       FID_DATA,
+                                       ctx,
+                                       runtime);
+
+  int num_samples = input.domain.hi()[2] - input.domain.lo()[2] + 1;
+  assert(attn->qoSeqLength == input.domain.hi()[1] - input.domain.lo()[1] + 1);
+  assert(attn->kvSeqLength == input.domain.hi()[1] - input.domain.lo()[1] + 1);
+  // int num_q_heads = weight.domain.hi()[1] - weight.domain.lo()[1] + 1;
+  int num_q_heads = attn->num_q_heads / attn->tensor_parallelism_degree;
+  int num_kv_heads = attn->num_kv_heads / attn->tensor_parallelism_degree;
+
+  assert(attn->oProjSize == output.domain.hi()[0] - output.domain.lo()[0] + 1);
+
+  Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                       .only_kind(Memory::GPU_FB_MEM)
+                       .best_affinity_to(task->target_proc)
+                       .first();
+  MemoryAllocator gpu_mem_allocator(gpu_mem);
+  if (attn->offload) {
+    // cpu-offload enabled
+    // use offload_reserved_space
+    gpu_mem_allocator.register_reserved_work_space(
+        handle.offload_reserve_space, handle.offload_reserve_space_size);
+  }
+  TreeIncMultiHeadSelfAttentionMeta *m =
+      new TreeIncMultiHeadSelfAttentionMeta(handle,
+                                            attn,
+                                            weight,
+                                            gpu_mem_allocator,
+                                            num_samples,
+                                            num_q_heads,
+                                            num_kv_heads);
+  if (!attn->offload) {
+    // assert that we didn't over allocate memory
+    assert(gpu_mem_allocator.reserved_allocated_size ==
+           gpu_mem_allocator.reserved_total_size);
+  }
+  m->profiling = attn->profiling;
+
+  if (attn->quantization_type == DT_NONE) {
+    assert(weight.domain.get_volume() * data_type_size(weight.data_type) ==
+           m->weightSize);
+  }
+  return m;
+}
+
+void TreeIncMultiHeadSelfAttention::forward(FFModel const &ff) {
+  // TreeIncMultiHeadSelfAttention doesn't support forward
+  assert(false);
+}
+
+FutureMap TreeIncMultiHeadSelfAttention::inference(
+    FFModel const &ff,
+    BatchConfigFuture const &bc,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  MachineView const *view = mv ? mv : &batch_outputs[0]->machine_view;
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  size_t machine_view_hash = view->hash();
+  int idx = 0;
+  IndexLauncher launcher(TREE_INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(weights[0]->part,
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        weights[0]->region,
+                        ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+  launcher.add_field(idx++, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(idx++, FID_DATA);
+  if (bias) {
+    launcher.add_region_requirement(
+        RegionRequirement(weights[1]->part,
+                          0 /*projection id*/,
+                          READ_ONLY,
+                          EXCLUSIVE,
+                          weights[1]->region,
+                          ff.cpu_offload ? MAP_TO_ZC_MEMORY : 0));
+    launcher.add_field(idx++, FID_DATA);
+  }
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+/*
+  regions[0](I): input
+  regions[3](I): weight
+  regions[4](O): output
+*/
+void TreeIncMultiHeadSelfAttention::inference_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(task->regions.size() == regions.size());
+
+  // TreeVerifyBatchConfig const *bc = (TreeVerifyBatchConfig *)task->args;
+  TreeVerifyBatchConfig const &bc =
+      Future(task->futures[0]).get_result<TreeVerifyBatchConfig>();
+  log_tree_verify.debug(
+      "TreeVerifyBatchConfig, num_tokens: %d, num_requests: %d",
+      bc.num_tokens,
+      bc.num_active_requests());
+  if (bc.num_tokens == 0) {
+    return;
+  }
+
+  TreeIncMultiHeadSelfAttentionMeta *m =
+      *((TreeIncMultiHeadSelfAttentionMeta **)task->local_args);
+  assert((*m->bias ? regions.size() == 4 : regions.size() == 3));
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR weight = helperGetGenericTensorAccessorRO(
+      m->weight_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[2], task->regions[2], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR biases;
+  if (*m->bias) {
+    biases = helperGetGenericTensorAccessorRO(m->weight_type[1],
+                                              regions[3],
+                                              task->regions[3],
+                                              FID_DATA,
+                                              ctx,
+                                              runtime);
+    Domain bias_domain = runtime->get_index_space_domain(
+        ctx, task->regions[3].region.get_index_space());
+    assert(bias_domain.get_dim() == 4);
+  }
+
+  Domain input_domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  Domain weight_domain = runtime->get_index_space_domain(
+      ctx, task->regions[1].region.get_index_space());
+  Domain output_domain = runtime->get_index_space_domain(
+      ctx, task->regions[2].region.get_index_space());
+
+  assert(input_domain.get_dim() == 4);
+  assert(weight_domain.get_dim() == 2);
+  assert(output_domain.get_dim() == 4);
+
+  /* print_tensor<float>(input.get_float_ptr(),
+                      input_domain.get_volume(),
+                      "[Attention:forward:query]"); */
+
+  assert(task->index_point.get_dim() == 1);
+
+  TreeIncMultiHeadSelfAttention::inference_kernel_wrapper(
+      m, &bc, task->index_point.point_data[0], input, weight, output, biases);
+#ifdef INFERENCE_TESTS
+  printf("Checking TreeIncMultiHeadSelfAttention computations...\n");
+
+  // =============================================================================
+  //  Define helper functions to handle row-major arrays
+  // =============================================================================
+
+  auto set_value_row_major = [](float *arr,
+                                std::vector<int> const &shape,
+                                std::vector<int> const &indices,
+                                float value) -> void {
+    int offset = 0;
+    for (int i = 0; i < shape.size(); i++) {
+      int index = indices[i];
+      int stride = 1;
+      for (int j = i + 1; j < shape.size(); j++) {
+        stride *= shape[j];
+      }
+      offset += index * stride;
+    }
+    *(arr + offset) = value;
+  };
+
+  // =============================================================================
+  //  Load input/output/weights and parse general configs
+  // =============================================================================
+
+  float *input_cpu =
+      download_tensor<float>(input.get_float_ptr(), input_domain.get_volume());
+  assert(input_cpu != nullptr);
+  float *weight_cpu = download_tensor<float>(weight.get_float_ptr(),
+                                             weight_domain.get_volume());
+  assert(weight_cpu != nullptr);
+  float *output_cpu = download_tensor<float>(output.get_float_ptr(),
+                                             output_domain.get_volume());
+  assert(output_cpu != nullptr);
+
+  // Input tensor dimensions
+  coord_t data_dim = input_domain.hi()[0] - input_domain.lo()[0] + 1;
+  coord_t max_sequence_length = input_domain.hi()[1] - input_domain.lo()[1] + 1;
+  coord_t batch_size = input_domain.hi()[2] - input_domain.lo()[2] + 1;
+  coord_t replica_dim = input_domain.hi()[3] - input_domain.lo()[3] + 1;
+  assert(replica_dim == 1);
+
+  size_t effective_batch_size = max_sequence_length * batch_size;
+  float inputs_arr[data_dim][effective_batch_size] = {0};
+  for (size_t i = 0; i < data_dim * bc.num_active_tokens(); i++) {
+    size_t data_index = i % data_dim;
+    size_t token_index = i / data_dim;
+    assert(data_index < data_dim);
+    assert(token_index < effective_batch_size);
+    inputs_arr[data_index][token_index] = input_cpu[i];
+  }
+  torch::Tensor torch_input = torch::from_blob(
+      inputs_arr, {data_dim, (long int)effective_batch_size}, torch::kFloat32);
+
+  // Weight tensor dimensions
+  coord_t all_weight_params = weight_domain.hi()[0] - weight_domain.lo()[0] + 1;
+  coord_t num_q_heads = weight_domain.hi()[1] - weight_domain.lo()[1] + 1;
+  replica_dim = weight_domain.hi()[2] - weight_domain.lo()[2] + 1;
+  size_t qParas = m->qProjSize * m->qSize;
+  size_t kParas = m->kProjSize * m->kSize;
+  size_t vParas = m->vProjSize * m->vSize;
+  size_t oParas = m->oProjSize * (m->vProjSize > 0 ? m->vProjSize : m->vSize);
+
+  assert(all_weight_params == qParas + kParas + vParas + oParas);
+  assert(num_q_heads == m->num_q_heads);
+  assert(replica_dim == 1);
+
+  assert(m->qSize == m->kSize && m->kSize == m->vSize);
+  // printf("m->qSize: %i\n", m->qSize);
+  //  keep things simple for now
+  assert(m->qProjSize == m->kProjSize && m->kProjSize == m->vProjSize);
+  long int proj_sum = m->qProjSize + m->kProjSize + m->vProjSize;
+  // load weight manually because Torch can't easily read a tensor serialized in
+  // column-major order.
+
+  // printf("m->kProjSize: %i, TreeVerifyBatchConfig::MAX_NUM_TOKENS: %i, "
+  //     "bc.num_active_tokens(): %i, num_q_heads: %lli,
+  //     TreeVerifyBatchConfig::MAX_NUM_REQUESTS: %i, "
+  //     "bc.num_active_requests(): %i\n", m->kProjSize,
+  //     TreeVerifyBatchConfig::MAX_NUM_TOKENS, bc.num_active_tokens(),
+  //     num_q_heads, TreeVerifyBatchConfig::MAX_NUM_REQUESTS,
+  //     bc.num_active_requests());
+  // for (int t=0; t < bc.num_active_tokens(); t++) {
+  //   printf("token %i has request_index: %li and token_position: %li\n",
+  //   t, bc.token2ids.token_indexes[t].request_index,
+  //   bc.token2ids.token_indexes[t].token_position);
+  // }
+
+  // =============================================================================
+  //  Load the output tensor (with CUDA results), and create a Torch tensor
+  // =============================================================================
+
+  float output_cuda[m->oProjSize][effective_batch_size] = {0};
+  for (int i = 0; i < m->oProjSize * effective_batch_size; i++) {
+    int row_idx = i % m->oProjSize;
+    int col_idx = i / m->oProjSize;
+    assert(row_idx < m->oProjSize && col_idx < effective_batch_size);
+    output_cuda[row_idx][col_idx] = output_cpu[i];
+  }
+  torch::Tensor torch_out_cuda =
+      torch::from_blob(output_cuda,
+                       {m->oProjSize, (int64_t)effective_batch_size},
+                       torch::kFloat32);
+
+  // =============================================================================
+  //  Load the Q/K/V projection weights, and create a Torch tensor
+  // =============================================================================
+  std::vector<int> w_qkv_shape = {m->qSize, m->qProjSize, 3, (int)num_q_heads};
+  float *w_qkv =
+      (float *)calloc(m->qSize * m->qProjSize * 3 * num_q_heads, sizeof(float));
+  assert(w_qkv[0] == 0.0f);
+
+  for (int h = 0; h < num_q_heads; h++) {
+    for (size_t i = 0; i < m->qProjSize * m->qSize; i++) {
+      int row_index = i % m->qSize;
+      int column_index = i / m->qSize;
+      // Q
+      set_value_row_major(w_qkv,
+                          w_qkv_shape,
+                          {row_index, column_index, 0, h},
+                          weight_cpu[all_weight_params * h +
+                                     m->qSize * column_index + row_index]);
+      // K
+      set_value_row_major(
+          w_qkv,
+          w_qkv_shape,
+          {row_index, column_index, 1, h},
+          weight_cpu[all_weight_params * h + m->qProjSize * m->qSize +
+                     m->qSize * column_index + row_index]);
+      // V
+      set_value_row_major(
+          w_qkv,
+          w_qkv_shape,
+          {row_index, column_index, 2, h},
+          weight_cpu[all_weight_params * h + 2 * m->qProjSize * m->qSize +
+                     m->qSize * column_index + row_index]);
+    }
+  }
+  // convert weights to torch tensor
+  torch::Tensor torch_w_qkv = torch::from_blob(
+      w_qkv, {m->qSize, m->qProjSize, 3, (int)num_q_heads}, torch::kFloat32);
+
+  /* std::cout << "Torch projection weights size: " << torch_w_qkv.sizes()
+            << std::endl;
+  std::cout << "Torch input size: " << torch_input.sizes() << std::endl;
+  std::cout << "Number of active tokens: " << bc.num_active_tokens()
+            << std::endl; */
+  // std::cout << "torch_w_qkv:" << std::endl << torch_w_qkv << std::endl;
+
+  // =============================================================================
+  //  Compute the Q/K/V projections, and compare the results with CUDA
+  // =============================================================================
+
+  //  ----------------------- C++ computations & checks ------------------------
+  torch::Tensor qkv_projs = torch::einsum(
+      "ijkl,im->jmkl",
+      {torch_w_qkv,
+       torch_input.index({Slice(), Slice(0, bc.num_active_tokens())})});
+  // std::cout << "qkv_projs size: " << qkv_projs.sizes() << std::endl;
+  assert(qkv_projs.sizes()[0] == m->qProjSize);
+  assert(qkv_projs.sizes()[1] == bc.num_active_tokens() &&
+         qkv_projs.sizes()[1] <= effective_batch_size);
+  assert(qkv_projs.sizes()[2] == 3);
+  assert(qkv_projs.sizes()[3] == num_q_heads);
+  free(w_qkv);
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *QKVProjArray_cpu = download_tensor<float>(
+      m->devQKVProjArray,
+      TreeVerifyBatchConfig::MAX_NUM_TOKENS * proj_sum * m->num_q_heads);
+  assert(QKVProjArray_cpu != nullptr);
+
+  std::vector<int> QKVProjArray_converted_shape = {
+      m->qProjSize, bc.num_active_tokens(), 3, (int)num_q_heads};
+  float *QKVProjArray_converted = (float *)calloc(
+      m->qProjSize * bc.num_active_tokens() * 3 * num_q_heads, sizeof(float));
+
+  // skip over padding at the end of QKVProjArray_cpu
+  // convert from column order to 3D matrix because torch cannot automatically
+  // import matrices flattened in column order
+  for (size_t i = 0; i < proj_sum * bc.num_active_tokens() * num_q_heads; i++) {
+    int proj_size_index = i % m->qProjSize;
+    int head_index = i / (proj_sum * bc.num_active_tokens());
+    int token_index =
+        ((i - head_index * proj_sum * bc.num_active_tokens()) / m->qProjSize) %
+        bc.num_active_tokens();
+    int qkv_offset = (i - head_index * proj_sum * bc.num_active_tokens()) /
+                     (m->qProjSize * bc.num_active_tokens());
+    assert(proj_size_index < proj_sum);
+    assert(head_index < num_q_heads);
+    assert(token_index < bc.num_active_tokens());
+    assert(qkv_offset < 3);
+    set_value_row_major(QKVProjArray_converted,
+                        QKVProjArray_converted_shape,
+                        {proj_size_index, token_index, qkv_offset, head_index},
+                        QKVProjArray_cpu[i]);
+  }
+  torch::Tensor QKVProjArray_torch =
+      torch::from_blob(QKVProjArray_converted,
+                       {m->qProjSize, bc.num_active_tokens(), 3, num_q_heads},
+                       torch::kFloat32);
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+  // std::cout << "QKVProjArray_torch" << std::endl;
+  // for (int i=0; i<num_q_heads; i++) {
+  //   for (int j=0; j<3; j++) {
+  //     std::cout << QKVProjArray_torch.index({Slice(), Slice(), j, i}) <<
+  //     std::endl;
+  //   }
+  // }
+  // std::cout << "qkv_projs" << std::endl;
+  // for (int i=0; i<num_q_heads; i++) {
+  //   for (int j=0; j<3; j++) {
+  //     std::cout << qkv_projs.index({Slice(), Slice(), j, i}) << std::endl;
+  //   }
+  // }
+  assert(torch::allclose(QKVProjArray_torch, qkv_projs, 1e-05, 1e-05));
+  free(QKVProjArray_converted);
+
+  // =============================================================================
+  //  Store the K/V projections into the cache
+  // =============================================================================
+
+  //  ----------------------- C++ operations & checks --------------------------
+  // Store projections into k/v cache arrays
+  for (size_t h = 0; h < num_q_heads; h++) {
+    for (size_t t = 0; t < bc.num_active_tokens(); t++) {
+      for (size_t d = 0; d < m->kProjSize; d++) {
+        size_t kcache_idx = d * MAX_SEQ_LEN * m->num_q_heads *
+                                TreeVerifyBatchConfig::MAX_NUM_REQUESTS +
+                            bc.tokensInfo[t].abs_depth_in_request *
+                                m->num_q_heads *
+                                TreeVerifyBatchConfig::MAX_NUM_REQUESTS +
+                            h * TreeVerifyBatchConfig::MAX_NUM_REQUESTS +
+                            bc.tokensInfo[t].request_index;
+        m->kcache[kcache_idx] =
+            qkv_projs.index({(int64_t)d, (int64_t)t, 1, (int64_t)h})
+                .item<float>();
+      }
+      for (size_t d = 0; d < m->vProjSize; d++) {
+        size_t vcache_idx = d * MAX_SEQ_LEN * m->num_q_heads *
+                                TreeVerifyBatchConfig::MAX_NUM_REQUESTS +
+                            bc.tokensInfo[t].abs_depth_in_request *
+                                m->num_q_heads *
+                                TreeVerifyBatchConfig::MAX_NUM_REQUESTS +
+                            h * TreeVerifyBatchConfig::MAX_NUM_REQUESTS +
+                            bc.tokensInfo[t].request_index;
+        m->vcache[vcache_idx] =
+            qkv_projs.index({(int64_t)d, (int64_t)t, 2, (int64_t)h})
+                .item<float>();
+      }
+    }
+  }
+  // Create torch tensors from the arrays
+  torch::Tensor K_t =
+      torch::from_blob(m->kcache,
+                       {m->kProjSize,
+                        MAX_SEQ_LEN,
+                        num_q_heads,
+                        TreeVerifyBatchConfig::MAX_NUM_REQUESTS},
+                       torch::kFloat32);
+  torch::Tensor V_t =
+      torch::from_blob(m->vcache,
+                       {m->vProjSize,
+                        MAX_SEQ_LEN,
+                        num_q_heads,
+                        TreeVerifyBatchConfig::MAX_NUM_REQUESTS},
+                       torch::kFloat32);
+
+  // Compute useful indices
+  std::vector<size_t> req_idxs;
+  std::vector<size_t> r_first_idx;
+  std::vector<size_t> r_num_tokens;
+  for (size_t t = 0; t < bc.num_active_tokens(); t++) {
+    size_t rid = bc.tokensInfo[t].request_index;
+    if (req_idxs.size() == 0 || req_idxs[req_idxs.size() - 1] != rid) {
+      req_idxs.push_back(rid);
+      r_first_idx.push_back(t);
+      r_num_tokens.push_back(1);
+    } else {
+      r_num_tokens[r_num_tokens.size() - 1]++;
+    }
+    assert(req_idxs.size() == r_first_idx.size() &&
+           r_first_idx.size() == r_num_tokens.size());
+  }
+  assert(req_idxs.size() == bc.num_active_requests());
+  assert(std::accumulate(r_num_tokens.begin(),
+                         r_num_tokens.end(),
+                         decltype(r_num_tokens)::value_type(0)) ==
+         bc.num_active_tokens());
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *keyCache_cpu = download_tensor<float>(
+      m->keyCache,
+      m->num_q_heads * m->kProjSize * TreeVerifyBatchConfig::MAX_NUM_REQUESTS *
+          MAX_SEQ_LEN);
+  float *valueCache_cpu = download_tensor<float>(
+      m->valueCache,
+      m->num_q_heads * m->vProjSize * TreeVerifyBatchConfig::MAX_NUM_REQUESTS *
+          MAX_SEQ_LEN);
+  assert(keyCache_cpu != nullptr);
+  assert(valueCache_cpu != nullptr);
+
+  float *kcache_cuda =
+      (float *)calloc(m->kProjSize * MAX_SEQ_LEN * m->num_q_heads *
+                          TreeVerifyBatchConfig::MAX_NUM_REQUESTS,
+                      sizeof(float));
+  float *vcache_cuda =
+      (float *)calloc(m->vProjSize * MAX_SEQ_LEN * m->num_q_heads *
+                          TreeVerifyBatchConfig::MAX_NUM_REQUESTS,
+                      sizeof(float));
+  int index = 0;
+  for (int i = 0; i < m->kProjSize; i++) {
+    for (int j = 0; j < MAX_SEQ_LEN; j++) {
+      for (int k = 0; k < m->num_q_heads; k++) {
+        for (int l = 0; l < TreeVerifyBatchConfig::MAX_NUM_REQUESTS; l++) {
+          int col_major_index =
+              l * m->kProjSize * MAX_SEQ_LEN * m->num_q_heads +
+              k * m->kProjSize * MAX_SEQ_LEN + j * m->kProjSize + i;
+          kcache_cuda[index++] = keyCache_cpu[col_major_index];
+        }
+      }
+    }
+  }
+  index = 0;
+  for (int i = 0; i < m->vProjSize; i++) {
+    for (int j = 0; j < MAX_SEQ_LEN; j++) {
+      for (int k = 0; k < m->num_q_heads; k++) {
+        for (int l = 0; l < TreeVerifyBatchConfig::MAX_NUM_REQUESTS; l++) {
+          int col_major_index =
+              l * m->vProjSize * MAX_SEQ_LEN * m->num_q_heads +
+              k * m->vProjSize * MAX_SEQ_LEN + j * m->vProjSize + i;
+          vcache_cuda[index++] = valueCache_cpu[col_major_index];
+        }
+      }
+    }
+  }
+  torch::Tensor K_t_cuda =
+      torch::from_blob(kcache_cuda,
+                       {m->kProjSize,
+                        MAX_SEQ_LEN,
+                        num_q_heads,
+                        TreeVerifyBatchConfig::MAX_NUM_REQUESTS},
+                       torch::kFloat32);
+  torch::Tensor V_t_cuda =
+      torch::from_blob(vcache_cuda,
+                       {m->vProjSize,
+                        MAX_SEQ_LEN,
+                        num_q_heads,
+                        TreeVerifyBatchConfig::MAX_NUM_REQUESTS},
+                       torch::kFloat32);
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+
+  // std::cout << "kcache differences:" << std::endl;
+  // for (int i=0; i < bc.num_active_requests() + 1; i++) {
+  //   for (int j=0; j < num_q_heads; j++) {
+  //     for (int l=0; l < m->kProjSize; l++) {
+  //       for (int k=0; k < MAX_SEQ_LEN; k++) {
+  //         size_t kcache_idx =
+  //           l * MAX_SEQ_LEN * num_q_heads *
+  //           TreeVerifyBatchConfig::MAX_NUM_REQUESTS + k * num_q_heads *
+  //           TreeVerifyBatchConfig::MAX_NUM_REQUESTS + j *
+  //           TreeVerifyBatchConfig::MAX_NUM_REQUESTS + i; if (
+  //           abs(m->kcache[kcache_idx] - keyCache_cpu[
+  //               i * m->kProjSize * MAX_SEQ_LEN * num_q_heads +
+  //               j * m->kProjSize * MAX_SEQ_LEN +
+  //               k * m->kProjSize +
+  //               l
+  //           ]) > 0.00001) {
+  //             printf("req: %i (rid: %i), head: %i, data_dim: %i, token_pos:
+  //             %i\n",
+  //                   i, req_idxs[i], j, l, k);
+  //           }
+  //       }
+  //     }
+  //   }
+  // }
+
+  //  std::cout << "keyCache from CUDA:" << std::endl;
+  //  for (int i=0; i<bc.num_active_requests()+1; i++) {
+  //    for (int j=0; j<num_q_heads; j++) {
+  //     for (int l=0; l<m->kProjSize; l++) {
+  //       for (int k=0; k< MAX_SEQ_LEN; k++) {
+  //         printf("%f ",
+  //           keyCache_cpu[i * m->kProjSize * MAX_SEQ_LEN * num_q_heads +
+  //               j * m->kProjSize * MAX_SEQ_LEN +
+  //               k * m->kProjSize +
+  //               l
+  //         ]);
+  //       }
+  //       printf("\n");
+  //     }
+  //     printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  //  std::cout << "valueCache from CUDA:" << std::endl;
+  //  for (int i=0; i<bc.num_active_requests()+1; i++) {
+  //    for (int j=0; j<num_q_heads; j++) {
+  //       for (int l=0; l<m->vProjSize; l++) {
+  //         for (int k=0; k< MAX_SEQ_LEN; k++) {
+  //           printf("%f ",
+  //             valueCache_cpu[
+  //                 i * m->vProjSize * MAX_SEQ_LEN * num_q_heads +
+  //                 j * m->vProjSize * MAX_SEQ_LEN +
+  //                 k * m->vProjSize +
+  //             l]);
+  //         }
+  //         printf("\n");
+  //       }
+  //       printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  //  printf("\n");
+
+  //  std::cout << "C++ kcache:" << std::endl;
+  //  for (int i=0; i<bc.num_active_requests()+1; i++) {
+  //    for (int j=0; j < num_q_heads; j++) {
+  //       for (int l=0; l < m->kProjSize; l++) {
+  //         for (int k=0; k < MAX_SEQ_LEN; k++) {
+  //           size_t kcache_idx =
+  //             l * MAX_SEQ_LEN * num_q_heads *
+  //             TreeVerifyBatchConfig::MAX_NUM_REQUESTS + k * num_q_heads *
+  //             TreeVerifyBatchConfig::MAX_NUM_REQUESTS + j *
+  //             TreeVerifyBatchConfig::MAX_NUM_REQUESTS + i;
+  //           printf("%f ", m->kcache[kcache_idx]);
+  //         }
+  //         printf("\n");
+  //       }
+  //       printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  //  std::cout << "C++ vcache:" << std::endl;
+  //  for (int i=0; i<bc.num_active_requests()+1; i++) {
+  //    for (int j=0; j<num_q_heads; j++) {
+  //       for (int l=0; l<m->vProjSize; l++) {
+  //         for (int k=0; k< MAX_SEQ_LEN; k++) {
+  //             size_t vcache_idx =
+  //               l * MAX_SEQ_LEN * num_q_heads *
+  //               TreeVerifyBatchConfig::MAX_NUM_REQUESTS + k * num_q_heads *
+  //               TreeVerifyBatchConfig::MAX_NUM_REQUESTS + j *
+  //               TreeVerifyBatchConfig::MAX_NUM_REQUESTS + i;
+  //             printf("%f ", m->vcache[vcache_idx]);
+  //         }
+  //         printf("\n");
+  //       }
+  //       printf("\n");
+  //    }
+  //    printf("\n");
+  //  }
+
+  assert(torch::allclose(K_t_cuda, K_t, 1e-05, 1e-05));
+  assert(torch::allclose(V_t_cuda, V_t, 1e-05, 1e-05));
+  free(kcache_cuda);
+  free(vcache_cuda);
+
+  // =============================================================================
+  //  Load the W_out projection weights
+  // =============================================================================
+
+  //  ----------------------- C++ operations & checks --------------------------
+  float *w_out = (float *)calloc(m->vProjSize * m->num_q_heads * m->oProjSize,
+                                 sizeof(float));
+  std::vector<int> w_out_shape = {m->vProjSize, m->num_q_heads, m->oProjSize};
+  assert(m->qProjSize == m->kProjSize && m->kProjSize == m->vProjSize);
+  for (int h = 0; h < num_q_heads; h++) {
+    for (int v = 0; v < m->vProjSize; v++) {
+      for (int o = 0; o < m->oProjSize; o++) {
+        set_value_row_major(
+            w_out,
+            w_out_shape,
+            {v, h, o},
+            weight_cpu[all_weight_params * h + 3 * m->qProjSize * m->qSize +
+                       m->vProjSize * o + v]);
+      }
+    }
+  }
+  // convert weights to torch tensor
+  torch::Tensor torch_w_out = torch::from_blob(
+      w_out, {m->vProjSize, m->num_q_heads, m->oProjSize}, torch::kFloat32);
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *w_out_cuda = download_tensor<float>(
+      m->W_out_contiguous, m->vProjSize * m->oProjSize * m->num_q_heads);
+  assert(w_out_cuda != nullptr);
+  float *converted_wout_tensor = (float *)calloc(
+      m->vProjSize * m->num_q_heads * m->oProjSize, sizeof(float));
+  std::vector<int> converted_wout_tensor_shape = {
+      m->vProjSize, m->num_q_heads, m->oProjSize};
+
+  for (int i = 0; i < m->vProjSize * m->num_q_heads * m->oProjSize; i++) {
+    int v_idx = i % m->vProjSize;
+    int h_idx = (i / m->vProjSize) % m->num_q_heads;
+    int o_idx = i / (m->vProjSize * m->num_q_heads);
+    assert(v_idx < m->vProjSize && h_idx < m->num_q_heads &&
+           o_idx < m->oProjSize);
+    set_value_row_major(converted_wout_tensor,
+                        converted_wout_tensor_shape,
+                        {v_idx, h_idx, o_idx},
+                        w_out_cuda[i]);
+  }
+  torch::Tensor w_out_cuda_tensor =
+      torch::from_blob(converted_wout_tensor,
+                       {m->vProjSize, m->num_q_heads, m->oProjSize},
+                       torch::kFloat32);
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+  assert(torch::allclose(w_out_cuda_tensor, torch_w_out, 1e-05, 1e-05));
+  free(converted_wout_tensor);
+
+  // =============================================================================
+  //  Compute the softmax(QK^T/sqrt(d_k))V product, request by request
+  // =============================================================================
+
+  //  ----------------------- C++ initialization steps -------------------------
+  torch::Tensor Q_projs = qkv_projs.index({Slice(), Slice(), 0, Slice()})
+                              .reshape({qkv_projs.sizes()[0],
+                                        qkv_projs.sizes()[1],
+                                        qkv_projs.sizes()[3]});
+
+  torch::Tensor qk_products[bc.num_active_requests()];
+  torch::Tensor qk_softmax[bc.num_active_requests()];
+  torch::Tensor attn_heads[bc.num_active_requests()];
+
+  torch::Tensor cpp_output =
+      torch::zeros({m->oProjSize, bc.num_active_tokens()});
+
+  //  ----------------------- Loading CUDA results for this step ---------------
+  float *qk_prods_cpu = download_tensor<float>(
+      m->qk_prods,
+      TreeVerifyBatchConfig::MAX_NUM_TOKENS *
+          TreeVerifyBatchConfig::MAX_NUM_TOKENS * num_q_heads);
+  assert(qk_prods_cpu != nullptr);
+
+  float *qk_prods_softmax_cpu = download_tensor<float>(
+      m->qk_prods_softmax,
+      TreeVerifyBatchConfig::MAX_NUM_TOKENS *
+          TreeVerifyBatchConfig::MAX_NUM_TOKENS * num_q_heads);
+  assert(qk_prods_softmax_cpu != nullptr);
+
+  float *attn_heads_cpu = download_tensor<float>(
+      m->attn_heads,
+      TreeVerifyBatchConfig::MAX_NUM_TOKENS * m->num_q_heads * m->vProjSize);
+  assert(attn_heads_cpu != nullptr);
+
+  //  ----------------------- Main loop (request by request) -------------------
+  size_t qk_prods_cpu_offset = 0;
+
+  for (size_t r = 0; r < bc.num_active_requests(); r++) {
+    // Compute pre-request parameters
+    size_t num_new_tokens = r_num_tokens[r];
+    int64_t rid = (int64_t)(req_idxs[r]);
+    int64_t num_tokens_received_so_far =
+        (int64_t)(bc.requestsInfo[rid].token_start_offset +
+                  bc.requestsInfo[rid].num_tokens_in_batch);
+    assert(num_new_tokens == bc.requestsInfo[rid].num_tokens_in_batch);
+    assert(num_tokens_received_so_far >= (int64_t)num_new_tokens);
+
+    //  ----------------------- C++ computations -------------------------------
+    // Get the slice of the Q projection tensor with the tokens in the current
+    // request
+    torch::Tensor Q_req =
+        Q_projs.index({Slice(),
+                       Slice(r_first_idx[r], r_first_idx[r] + num_new_tokens),
+                       Slice()});
+    // std::cout << "Q_req.sizes(): " << Q_req.sizes() << std::endl;
+    assert(Q_req.sizes()[0] == m->qProjSize);
+    assert(Q_req.sizes()[1] == num_new_tokens);
+    assert(Q_req.sizes()[2] == num_q_heads);
+
+    /*printf("\n------------ QK multiplication (C++) -------------\n");
+    printf("Request r=%lu. num_new_tokens: %lu, num_tokens_received_so_far: %li,
+    rid: %li, Qproj slice: (%i, %i)\n", r, num_new_tokens,
+    num_tokens_received_so_far, rid, r_first_idx[r], r_first_idx[r] +
+    num_new_tokens);
+
+    std::cout << "Q_req matrix (idk dims):" << std::endl <<
+    Q_req.index({Slice(), Slice(), 0}) << std::endl << std::endl; std::cout <<
+    "K_t matrix (ilk dims):" << std::endl << K_t.index({Slice(), Slice(0,
+    num_tokens_received_so_far), 0, rid}) << std::endl << std::endl; std::cout
+    << "C++ alpha: " << (1.0f / sqrt(m->kProjSize)) << std::endl;*/
+
+    // Compute (Q*K^T)/sqrt(d_k) matmul
+    qk_products[r] =
+        torch::einsum("ijk,ilk->jlk",
+                      {Q_req,
+                       K_t.index({Slice(),
+                                  Slice(0, num_tokens_received_so_far),
+                                  Slice(),
+                                  rid})}) *
+        (1.0f / sqrt(m->kProjSize));
+
+    // Set entries above diagonal to -inf to make attention causal.
+    for (int h = 0; h < num_q_heads; h++) {
+      qk_products[r].index(
+          {Slice(), Slice(num_tokens_received_so_far - num_new_tokens), h}) =
+          qk_products[r]
+              .index({Slice(),
+                      Slice(num_tokens_received_so_far - num_new_tokens),
+                      h})
+              .tril() +
+          torch::full({(int64_t)num_new_tokens, (int64_t)num_new_tokens},
+                      -INFINITY)
+              .triu()
+              .fill_diagonal_(0);
+    }
+    // Compute softmax for each request block
+    qk_softmax[r] = torch::softmax(qk_products[r], -2);
+    assert(qk_softmax[r].sizes()[0] == num_new_tokens);
+    assert(qk_softmax[r].sizes()[1] == num_tokens_received_so_far);
+    assert(qk_softmax[r].sizes()[2] == m->num_q_heads);
+
+    //  ------------------- Loading CUDA results for this step ---------------
+    float *converted_qk_prod = (float *)calloc(
+        num_new_tokens * num_tokens_received_so_far * num_q_heads,
+        sizeof(float));
+    float *converted_qk_prod_softmax = (float *)calloc(
+        num_new_tokens * num_tokens_received_so_far * num_q_heads,
+        sizeof(float));
+    std::vector<int> converted_qk_prod_shape = {
+        (int)num_new_tokens, (int)num_tokens_received_so_far, (int)num_q_heads};
+
+    for (size_t i = 0;
+         i < num_new_tokens * num_tokens_received_so_far * num_q_heads;
+         i++) {
+      size_t new_t_idx = i % num_new_tokens;
+      size_t all_t_idx = (i / num_new_tokens) % num_tokens_received_so_far;
+      size_t head_idx = i / (num_new_tokens * num_tokens_received_so_far);
+      assert(new_t_idx < num_new_tokens &&
+             all_t_idx < num_tokens_received_so_far && head_idx < num_q_heads);
+      set_value_row_major(converted_qk_prod,
+                          converted_qk_prod_shape,
+                          {(int)new_t_idx, (int)all_t_idx, (int)head_idx},
+                          qk_prods_cpu[i + qk_prods_cpu_offset]);
+      set_value_row_major(converted_qk_prod_softmax,
+                          converted_qk_prod_shape,
+                          {(int)new_t_idx, (int)all_t_idx, (int)head_idx},
+                          qk_prods_softmax_cpu[i + qk_prods_cpu_offset]);
+    }
+    torch::Tensor qk_prods_cuda = torch::from_blob(
+        converted_qk_prod,
+        {(int64_t)num_new_tokens, num_tokens_received_so_far, num_q_heads},
+        torch::kFloat32);
+    torch::Tensor qk_prods_softmax_cuda = torch::from_blob(
+        converted_qk_prod_softmax,
+        {(int64_t)num_new_tokens, num_tokens_received_so_far, num_q_heads},
+        torch::kFloat32);
+
+    //  ------------------- Comparing C++ & CUDA results ------------------
+    /* std::cout << "C++:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_products[r].index({Slice(), Slice(), h}) << std::endl;
+    }
+    std::cout << "CUDA:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_prods_cuda.index({Slice(), Slice(), h}) << std::endl;
+    } */
+    /* //
+    std::cout << "C++:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_softmax[r].index({Slice(), Slice(), h}) << std::endl;
+    }
+    std::cout << "CUDA:" <<std::endl;
+    for (int h=0; h<num_q_heads; h++) {
+      std::cout << qk_prods_softmax_cuda.index({Slice(), Slice(), h}) <<
+    std::endl;
+    } */
+    // std::cout << "C++ tril:" <<std::endl;
+    // for (int h=0; h<num_q_heads; h++) {
+    //   std::cout << qk_products[r].tril().index({Slice(), Slice(), h}) <<
+    //   std::endl;
+    // }
+    assert(torch::allclose(qk_prods_cuda, qk_products[r], 1e-05, 1e-05));
+    assert(torch::allclose(qk_prods_softmax_cuda, qk_softmax[r], 1e-05, 1e-05));
+    free(converted_qk_prod);
+    free(converted_qk_prod_softmax);
+
+    //  --------------------- C++ computations --------------------------
+    // Multiply softmax results by V
+    assert(
+        V_t.index({Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})
+            .sizes()[0] == m->vProjSize);
+    assert(
+        V_t.index({Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})
+            .sizes()[1] == num_tokens_received_so_far);
+    assert(
+        V_t.index({Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})
+            .sizes()[2] == m->num_q_heads);
+    attn_heads[r] = torch::einsum(
+        "ijk,ljk->ilk",
+        {qk_softmax[r],
+         V_t.index(
+             {Slice(), Slice(0, num_tokens_received_so_far), Slice(), rid})});
+    assert(attn_heads[r].sizes()[0] == num_new_tokens);
+    assert(attn_heads[r].sizes()[1] == m->vProjSize);
+    assert(attn_heads[r].sizes()[2] == m->num_q_heads);
+
+    //  ------------------- Loading CUDA results for this step  ---------------
+    float converted_attn_heads_cpu[num_new_tokens][m->vProjSize]
+                                  [m->num_q_heads] = {0};
+    for (int i = 0; i < num_new_tokens * m->vProjSize * m->num_q_heads; i++) {
+      int token_ix = i % num_new_tokens;
+      int vproj_idx = (i / num_new_tokens) % m->vProjSize;
+      int head_idx = i / (num_new_tokens * m->vProjSize);
+      assert(token_ix < num_new_tokens && vproj_idx < m->vProjSize &&
+             head_idx < m->num_q_heads);
+      converted_attn_heads_cpu[token_ix][vproj_idx][head_idx] =
+          attn_heads_cpu[r_first_idx[r] * m->vProjSize * m->num_q_heads + i];
+    }
+    torch::Tensor converted_attn_heads_cuda = torch::from_blob(
+        converted_attn_heads_cpu,
+        {(int64_t)num_new_tokens, m->vProjSize, m->num_q_heads},
+        torch::kFloat32);
+
+    //  -------------------- Comparing C++ & CUDA results -------------------
+    /* std::cout << "CUDA attn head for req " << r << ":" <<std::endl;
+    for (int h=0; h<m->num_q_heads; h++) {
+      std::cout << converted_attn_heads_cuda.index({Slice(), Slice(), h}) <<
+    std::endl;
+    }
+    std::cout << "C++ attn head for req " << r << ":" <<std::endl;
+    for (int h=0; h<m->num_q_heads; h++) {
+      std::cout << attn_heads[r].index({Slice(), Slice(), h}) << std::endl;
+    } */
+    assert(torch::allclose(
+        converted_attn_heads_cuda, attn_heads[r], 1e-05, 1e-05));
+
+    //  ----------------------- C++ computations ----------------------------
+    // Compute output values by projecting all heads to output space
+    cpp_output.index(
+        {Slice(),
+         Slice(r_first_idx[r], r_first_idx[r] + (int64_t)num_new_tokens)}) =
+        torch::einsum("jkl,ijk->li", {torch_w_out, attn_heads[r]});
+
+    // increment main loop's auxiliary index
+    qk_prods_cpu_offset +=
+        num_new_tokens * num_tokens_received_so_far * num_q_heads;
+  }
+
+  //  ----------------------- Comparing C++ & CUDA results ---------------------
+  /* std::cout << "C++:" <<std::endl;
+  for (int i=0; i<m->oProjSize; i++) {
+    std::cout << cpp_output.index({i, Slice()}) << std::endl;
+  }
+  std::cout << "CUDA:" <<std::endl;
+  for (int i=0; i<m->oProjSize; i++) {
+    std::cout << torch_out_cuda.index({i, Slice(0,
+  (int64_t)bc.num_active_tokens())}) << std::endl;
+  } */
+
+  assert(
+      torch::allclose(torch_out_cuda.index(
+                          {Slice(), Slice(0, (int64_t)bc.num_active_tokens())}),
+                      cpp_output,
+                      1e-05,
+                      1e-05));
+
+  // =============================================================================
+  //  Cleanup
+  // =============================================================================
+  free(w_out);
+  checkCUDA(cudaFreeHost(input_cpu));
+  checkCUDA(cudaFreeHost(weight_cpu));
+  checkCUDA(cudaFreeHost(output_cpu));
+  checkCUDA(cudaFreeHost(QKVProjArray_cpu));
+  checkCUDA(cudaFreeHost(keyCache_cpu));
+  checkCUDA(cudaFreeHost(valueCache_cpu));
+  checkCUDA(cudaFreeHost(qk_prods_cpu));
+  checkCUDA(cudaFreeHost(qk_prods_softmax_cpu));
+  checkCUDA(cudaFreeHost(attn_heads_cpu));
+  checkCUDA(cudaFreeHost(w_out_cuda));
+  // assert(false && "All good if you see this assert failure! :)");
+#endif
+  // Done with INFERENCE_TESTS block
+}
+
+void TreeIncMultiHeadSelfAttention::backward(FFModel const &ff) {
+  // TreeIncMultiHeadSelfAttention does not support backward
+  assert(false);
+}
+
+bool TreeIncMultiHeadSelfAttention::get_int_parameter(PMParameter para,
+                                                      int *value) const {
+  switch (para) {
+    case PM_NUM_HEADS:
+      *value = num_q_heads;
+      return true;
+    default:
+      return Op::get_int_parameter(para, value);
+  }
+}
+
+bool TreeIncMultiHeadSelfAttention::measure_operator_cost(
+    Simulator *sim, MachineView const &mv, CostMetrics &cost_metrics) const {
+  return false;
+}
+
+bool operator==(TreeIncMultiHeadSelfAttentionParams const &lhs,
+                TreeIncMultiHeadSelfAttentionParams const &rhs) {
+  return lhs.layer_guid == rhs.layer_guid && lhs.embed_dim == rhs.embed_dim &&
+         lhs.num_q_heads == rhs.num_q_heads && lhs.kdim == rhs.kdim &&
+         lhs.vdim == rhs.vdim && lhs.dropout == rhs.dropout &&
+         lhs.bias == rhs.bias && lhs.add_bias_kv == rhs.add_bias_kv &&
+         lhs.add_zero_attn == rhs.add_zero_attn &&
+         lhs.apply_rotary_embedding == rhs.apply_rotary_embedding &&
+         lhs.scaling_query == rhs.scaling_query &&
+         lhs.scaling_factor == rhs.scaling_factor &&
+         lhs.qk_prod_scaling == rhs.qk_prod_scaling;
+}
+
+TreeIncMultiHeadSelfAttentionParams
+    TreeIncMultiHeadSelfAttention::get_params() const {
+  TreeIncMultiHeadSelfAttentionParams params;
+  params.layer_guid = this->layer_guid;
+  params.embed_dim = this->oProjSize;
+  params.num_q_heads = this->num_q_heads;
+  params.num_kv_heads = this->num_kv_heads;
+  params.kdim = this->kProjSize;
+  params.vdim = this->vProjSize;
+  params.dropout = this->dropout;
+  params.bias = this->bias;
+  params.add_bias_kv = this->add_bias_kv;
+  params.add_zero_attn = this->add_zero_attn;
+  params.apply_rotary_embedding = this->apply_rotary_embedding;
+  params.scaling_query = this->scaling_query;
+  params.scaling_factor = this->scaling_factor;
+  params.qk_prod_scaling = this->qk_prod_scaling;
+  params.tensor_parallelism_degree = this->tensor_parallelism_degree;
+  return params;
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::TreeIncMultiHeadSelfAttentionParams>::operator()(
+    FlexFlow::TreeIncMultiHeadSelfAttentionParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.layer_guid.id);
+  hash_combine(key, params.embed_dim);
+  hash_combine(key, params.num_q_heads);
+  hash_combine(key, params.num_kv_heads);
+  hash_combine(key, params.kdim);
+  hash_combine(key, params.vdim);
+  hash_combine(key, params.dropout);
+  hash_combine(key, params.bias);
+  hash_combine(key, params.add_bias_kv);
+  hash_combine(key, params.add_zero_attn);
+  hash_combine(key, params.apply_rotary_embedding);
+  hash_combine(key, params.scaling_query);
+  hash_combine(key, params.scaling_factor);
+  hash_combine(key, params.qk_prod_scaling);
+  hash_combine(key, params.quantization_type);
+  hash_combine(key, params.offload);
+  hash_combine(key, params.tensor_parallelism_degree);
+  return key;
+}
+}; // namespace std
diff --git a/src/ops/tree_inc_multihead_self_attention.cpp b/src/ops/tree_inc_multihead_self_attention.cpp
new file mode 100644
index 0000000000..a20077efb4
--- /dev/null
+++ b/src/ops/tree_inc_multihead_self_attention.cpp
@@ -0,0 +1,102 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+
+/*static*/
+void TreeIncMultiHeadSelfAttention::inference_kernel_wrapper(
+    TreeIncMultiHeadSelfAttentionMeta *m,
+    TreeVerifyBatchConfig const *bc,
+    int shard_id,
+    GenericTensorAccessorR const &input,
+    GenericTensorAccessorR const &weight,
+    GenericTensorAccessorW const &output,
+    GenericTensorAccessorR const &bias) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+
+  hipEvent_t t_start, t_end;
+  if (m->profiling) {
+    hipEventCreate(&t_start);
+    hipEventCreate(&t_end);
+    hipEventRecord(t_start, stream);
+  }
+
+  handle_unimplemented_hip_kernel(OP_TREE_INC_MULTIHEAD_SELF_ATTENTION);
+
+  if (m->profiling) {
+    hipEventRecord(t_end, stream);
+    checkCUDA(hipEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(hipEventElapsedTime(&elapsed, t_start, t_end));
+    hipEventDestroy(t_start);
+    hipEventDestroy(t_end);
+    printf("TreeIncMultiHeadSelfAttention forward time = %.2fms\n", elapsed);
+    // print_tensor<3, float>(acc_query.ptr, acc_query.rect,
+    // "[Attention:forward:query]"); print_tensor<3, float>(acc_output.ptr,
+    // acc_output.rect, "[Attention:forward:output]");
+  }
+}
+
+TreeIncMultiHeadSelfAttentionMeta::TreeIncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    TreeIncMultiHeadSelfAttention const *attn,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _num_q_heads,
+    int _num_kv_heads)
+    : IncMultiHeadSelfAttentionMeta(handler,
+                                    TREE_VERIFY_MODE,
+                                    attn,
+                                    attn->qSize,
+                                    attn->kSize,
+                                    attn->vSize,
+                                    attn->qProjSize,
+                                    attn->kProjSize,
+                                    attn->vProjSize,
+                                    attn->oProjSize,
+                                    attn->apply_rotary_embedding,
+                                    attn->bias,
+                                    attn->scaling_query,
+                                    attn->qk_prod_scaling,
+                                    attn->add_bias_kv,
+                                    attn->scaling_factor,
+                                    weight,
+                                    gpu_mem_allocator,
+                                    num_samples,
+                                    attn->num_q_heads,
+                                    attn->num_kv_heads,
+                                    _num_q_heads,
+                                    _num_kv_heads,
+                                    attn->quantization_type,
+                                    attn->offload),
+      num_active_tokens(0) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(miopenSetStream(handler.dnn, stream));
+}
+
+TreeIncMultiHeadSelfAttentionMeta::~TreeIncMultiHeadSelfAttentionMeta(void) {}
+
+}; // namespace FlexFlow
diff --git a/src/ops/tree_inc_multihead_self_attention.cu b/src/ops/tree_inc_multihead_self_attention.cu
new file mode 100644
index 0000000000..69f085d3eb
--- /dev/null
+++ b/src/ops/tree_inc_multihead_self_attention.cu
@@ -0,0 +1,729 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
+#include "cuComplex.h"
+#endif
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/ops/kernels/inc_multihead_self_attention_kernels.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+
+using namespace Kernels::IncMultiHeadAttention;
+
+namespace Kernels {
+namespace TreeIncMultiHeadAttention {
+
+template <typename DT>
+__global__ void commit_tokens_kernel(
+    DT const *devQKVProjArray,
+    DT *kCache_ptr,
+    DT *vCache_ptr,
+    TreeVerifyBatchConfig::CommittedTokensInfo const *committedTokenInfos,
+    int qProjSize,
+    int kProjSize,
+    int vProjSize,
+    int num_tokens_to_commit,
+    int num_active_tokens_in_last_batch,
+    int num_q_heads,
+    int num_kv_heads,
+    int max_seq_len) {
+
+  CUDA_KERNEL_LOOP(
+      i, num_tokens_to_commit * (kProjSize + vProjSize) * num_kv_heads) {
+    bool k_cache = i < (num_tokens_to_commit * kProjSize * num_kv_heads);
+    int real_i =
+        k_cache ? i : i - (num_tokens_to_commit * kProjSize * num_kv_heads);
+
+    int proj_size = k_cache ? kProjSize : vProjSize;
+    int data_idx = real_i % proj_size;
+    int head_idx = real_i / (num_tokens_to_commit * proj_size);
+    int token_pos =
+        (real_i - head_idx * (num_tokens_to_commit * proj_size)) / proj_size;
+    int token_idx_in_last_batch = committedTokenInfos[token_pos].token_index;
+    assert(token_idx_in_last_batch < num_active_tokens_in_last_batch);
+
+    int q_array_size =
+        qProjSize * num_active_tokens_in_last_batch * num_q_heads;
+    int k_array_size =
+        kProjSize * num_active_tokens_in_last_batch * num_kv_heads;
+
+    DT val =
+        devQKVProjArray[q_array_size + (k_cache ? 0 : k_array_size) +
+                        head_idx * proj_size * num_active_tokens_in_last_batch +
+                        token_idx_in_last_batch * proj_size + data_idx];
+    int const req_id = committedTokenInfos[token_pos].request_index;
+    int const tok_id = committedTokenInfos[token_pos].token_depth;
+
+    DT *cache_ptr = k_cache ? kCache_ptr : vCache_ptr;
+    cache_ptr[req_id * (num_kv_heads * max_seq_len * proj_size) +
+              head_idx * (max_seq_len * proj_size) + tok_id * proj_size +
+              data_idx] = val;
+  }
+}
+
+template <typename DT>
+void commit_tokens(TreeIncMultiHeadSelfAttentionMeta const *m,
+                   TreeVerifyBatchConfig const *bc,
+                   cudaStream_t stream) {
+  int num_tokens_to_commit = bc->num_tokens_to_commit;
+  if (num_tokens_to_commit > 0) {
+    int parallelism =
+        (m->kProjSize + m->vProjSize) * num_tokens_to_commit * m->num_kv_heads;
+    commit_tokens_kernel<<<GET_BLOCKS(parallelism),
+                           min(CUDA_NUM_THREADS, parallelism),
+                           0,
+                           stream>>>(
+        static_cast<DT *>(m->devQKVProjArray),
+        static_cast<DT *>(m->keyCache),
+        static_cast<DT *>(m->valueCache),
+        m->committed_token_infos,
+        m->qProjSize,
+        m->kProjSize,
+        m->vProjSize,
+        num_tokens_to_commit,
+        m->num_active_tokens, // number of active tokens in previous batch
+        m->num_q_heads,
+        m->num_kv_heads,
+        BatchConfig::MAX_SEQ_LENGTH);
+  }
+}
+
+template <typename DT>
+__global__ void update_tree_branch_kv_cache(
+    DT const *devQKVProjArray,
+    DT *kCache_ptr,
+    DT *vCache_ptr,
+    TreeVerifyBatchConfig::PerTokenInfo const *tokenInfos,
+    int qProjSize,
+    int kProjSize,
+    int vProjSize,
+    int num_tokens_in_branch,
+    int processed_tokens_in_batch,
+    int total_tokens_in_batch,
+    int num_q_heads,
+    int num_kv_heads,
+    int max_seq_len) {
+  CUDA_KERNEL_LOOP(
+      i, num_tokens_in_branch * (kProjSize + vProjSize) * num_kv_heads) {
+
+    int q_array_size = qProjSize * total_tokens_in_batch * num_q_heads;
+    int k_array_size = kProjSize * total_tokens_in_batch * num_kv_heads;
+
+    bool k_cache = i < (num_tokens_in_branch * kProjSize * num_kv_heads);
+    int real_i =
+        k_cache ? i : i - (num_tokens_in_branch * kProjSize * num_kv_heads);
+
+    int proj_size = k_cache ? kProjSize : vProjSize;
+    int data_idx = real_i % proj_size;
+    int token_idx =
+        (real_i / proj_size) % num_tokens_in_branch; // index in the tree branch
+    int head_idx = real_i / (proj_size * num_tokens_in_branch);
+
+    token_idx += processed_tokens_in_batch; // get index in the whole batch
+    DT val = devQKVProjArray[q_array_size + (k_cache ? 0 : k_array_size) +
+                             head_idx * proj_size * total_tokens_in_batch +
+                             token_idx * proj_size + data_idx];
+
+    int const req_id = tokenInfos[token_idx].request_index;
+    int const tok_id = tokenInfos[token_idx].abs_depth_in_request;
+    DT *cache_ptr = k_cache ? kCache_ptr : vCache_ptr;
+
+    cache_ptr[req_id * (num_kv_heads * max_seq_len * proj_size) +
+              head_idx * (max_seq_len * proj_size) + tok_id * proj_size +
+              data_idx] = val;
+  }
+}
+
+template <typename DT>
+__global__ void tree_fill_entries_above_diagonal(DT *matrix,
+                                                 size_t new_tokens,
+                                                 size_t total_tokens_in_request,
+                                                 size_t num_q_heads,
+                                                 DT value) {
+  CUDA_KERNEL_LOOP(i, new_tokens * total_tokens_in_request * num_q_heads) {
+    // size_t head_idx = i / (new_tokens * total_tokens_in_request);
+    size_t src_idx = (i / new_tokens) % total_tokens_in_request;
+    size_t dst_idx = i % new_tokens + total_tokens_in_request - new_tokens;
+    // Casual Mask
+    if (src_idx > dst_idx) {
+      matrix[i] = value;
+    }
+  }
+}
+
+template <typename DT>
+void compute_attention_kernel(TreeIncMultiHeadSelfAttentionMeta const *m,
+                              TreeVerifyBatchConfig const *bc,
+                              int shard_id,
+                              DT *output_ptr,
+                              DT const *bias_ptr,
+                              DT const *weight_ptr,
+                              cudaStream_t stream) {
+  checkCUDA(cublasSetStream(m->handle.blas, stream));
+  checkCUDNN(cudnnSetStream(m->handle.dnn, stream));
+  cudaDataType_t cublas_data_type = ff_to_cuda_datatype(m->output_type[0]);
+  cudnnDataType_t cudnn_data_type = ff_to_cudnn_datatype(m->output_type[0]);
+  assert(data_type_size(m->output_type[0]) == sizeof(DT));
+#if CUDA_VERSION >= 11000
+  // TODO: currently set the default to CUBLAS_COMPUTE_16F for best performance
+  cublasComputeType_t compute_type = CUBLAS_COMPUTE_16F;
+#else
+  cudaDataType_t compute_type = cublas_data_type;
+#endif
+  // int num_requests = bc->num_active_requests();
+  int processed_tokens_in_batch = 0;
+  // int qkv_block_size =
+  //     (m->qProjSize + m->kProjSize + m->vProjSize) * bc->num_active_tokens();
+  int q_block_size = m->qProjSize * bc->num_active_tokens();
+  int kt_block_size = m->kProjSize * BatchConfig::MAX_SEQ_LENGTH;
+  int kt_req_block_size = kt_block_size * m->num_kv_heads;
+  int vt_block_size = m->vProjSize * BatchConfig::MAX_SEQ_LENGTH;
+  int vt_req_block_size = vt_block_size * m->num_kv_heads;
+  assert(m->qProjSize == m->kProjSize);
+
+  for (int i = 0; i < bc->MAX_NUM_REQUESTS; i++) {
+    if (bc->request_completed[i]) {
+      continue;
+    }
+    int last_token_idx_of_the_request =
+        processed_tokens_in_batch + bc->requestsInfo[i].num_tokens_in_batch - 1;
+    while (processed_tokens_in_batch <= last_token_idx_of_the_request) {
+      int num_new_tokens = 1;
+      int j = processed_tokens_in_batch;
+      while ((j + 1 <= last_token_idx_of_the_request) &&
+             (bc->tokensInfo[j].abs_depth_in_request + 1 ==
+              bc->tokensInfo[j + 1].abs_depth_in_request)) {
+        j++;
+        num_new_tokens++;
+      }
+
+      int total_tokens_in_request = bc->tokensInfo[j].abs_depth_in_request + 1;
+      assert(num_new_tokens >= 1 && total_tokens_in_request >= num_new_tokens);
+      {
+        // update K-V cache
+        int parallelism =
+            (m->kProjSize + m->vProjSize) * num_new_tokens * m->num_kv_heads;
+        update_tree_branch_kv_cache<<<GET_BLOCKS(parallelism),
+                                      min(CUDA_NUM_THREADS, parallelism),
+                                      0,
+                                      stream>>>(
+            static_cast<DT *>(m->devQKVProjArray),
+            static_cast<DT *>(m->keyCache),
+            static_cast<DT *>(m->valueCache),
+            m->token_infos,
+            m->qProjSize,
+            m->kProjSize,
+            m->vProjSize,
+            num_new_tokens,            // num_tokens_in_branch
+            processed_tokens_in_batch, // num_processed_tokens_in_batch
+            m->num_active_tokens,      // total_tokens_in_batch
+            m->num_q_heads,
+            m->num_kv_heads,
+            BatchConfig::MAX_SEQ_LENGTH);
+      }
+
+      // bc->token_last_available_idx[i] + 1;
+      // Compute (QK^T/sqrt(d_k))
+      int m_ = num_new_tokens;
+      int n = total_tokens_in_request;
+      int k = m->qProjSize;
+      int lda = k, ldb = k, ldc = m_;
+      int strideA = q_block_size;
+      int strideB = kt_block_size;
+      int strideC = num_new_tokens * total_tokens_in_request;
+
+      // a flag of using this scaling alpha
+      DT alpha = 1.0f, beta = 0.0f;
+      if (*m->qk_prod_scaling) {
+        alpha = static_cast<DT>(1.0f / sqrt(m->kProjSize));
+      }
+      // To get A, skip over Q entries from previous requests (same head)
+      DT const *A = static_cast<DT *>(m->devQKVProjArray) +
+                    processed_tokens_in_batch * m->qProjSize;
+      // To get B, skip over K entries from previous requests (all heads +
+      // padding)
+      DT const *B = static_cast<DT *>(m->keyCache) + i * kt_req_block_size;
+      // To get C, skip over QK^T products from previous requests
+      DT *C = static_cast<DT *>(m->qk_prods);
+
+      if (m->num_q_heads == m->num_kv_heads) {
+        checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                             CUBLAS_OP_T,
+                                             CUBLAS_OP_N,
+                                             m_,
+                                             n,
+                                             k,
+                                             &alpha,
+                                             A,
+                                             cublas_data_type,
+                                             lda,
+                                             strideA,
+                                             B,
+                                             cublas_data_type,
+                                             ldb,
+                                             strideB,
+                                             &beta,
+                                             C,
+                                             cublas_data_type,
+                                             ldc,
+                                             strideC,
+                                             m->num_q_heads,
+                                             compute_type,
+                                             CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      } else {
+        strideB = 0;
+        int one_step_heads = m->num_q_heads / m->num_kv_heads;
+        for (int step = 0; step < m->num_kv_heads; step++) {
+          checkCUDA(
+              cublasGemmStridedBatchedEx(m->handle.blas,
+                                         CUBLAS_OP_T,
+                                         CUBLAS_OP_N,
+                                         m_,
+                                         n,
+                                         k,
+                                         &alpha,
+                                         A + step * strideA * one_step_heads,
+                                         cublas_data_type,
+                                         lda,
+                                         strideA,
+                                         B + step * kt_block_size,
+                                         cublas_data_type,
+                                         ldb,
+                                         strideB,
+                                         &beta,
+                                         C + step * strideC * one_step_heads,
+                                         cublas_data_type,
+                                         ldc,
+                                         strideC,
+                                         one_step_heads,
+                                         compute_type,
+                                         CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+        }
+      }
+
+      // Fill all elements above diagonal in qk prods with -inf to force
+      // causal attention.
+      assert(num_new_tokens <= total_tokens_in_request);
+      if (num_new_tokens > 1) {
+        size_t parallelism =
+            m->num_q_heads * num_new_tokens * total_tokens_in_request;
+        tree_fill_entries_above_diagonal<<<GET_BLOCKS(parallelism),
+                                           min((size_t)CUDA_NUM_THREADS,
+                                               parallelism),
+                                           0,
+                                           stream>>>(
+            C,
+            num_new_tokens,
+            total_tokens_in_request,
+            m->num_q_heads,
+            static_cast<DT>(-INFINITY));
+      }
+      // Compute Softmax(QK^T/sqrt(d_k))
+      // Before modifying the parameters below, make sure to read the following
+      // description of the CUDNN_TENSOR_NCHW tensor layout, from
+      // https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorFormat_t:
+      // This tensor format specifies that the data is laid out in the following
+      // order: batch size, feature maps, rows, columns. The strides are
+      // implicitly defined in such a way that the data are contiguous in memory
+      // with no padding between images, feature maps, rows, and columns; the
+      // columns are the inner dimension and the images are the outermost
+      // dimension.
+      int n_param = m->num_q_heads;
+      int c_param = total_tokens_in_request;
+      int h_param = 1;
+      int w_param = num_new_tokens;
+      checkCUDNN(cudnnSetTensor4dDescriptor(m->qk_tensor,
+                                            CUDNN_TENSOR_NCHW,
+                                            cudnn_data_type,
+                                            n_param,
+                                            c_param,
+                                            h_param,
+                                            w_param));
+      float softmax_alpha = 1.0f, softmax_beta = 0.0f;
+      DT *C_softmax = static_cast<DT *>(m->qk_prods_softmax);
+      // The softmax operation below is executed according to the
+      // CUDNN_SOFTMAX_MODE_CHANNEL, which is also described in the docs: The
+      // softmax operation is computed per spatial location (H,W) per image (N)
+      // across dimension C.
+      checkCUDNN(cudnnSoftmaxForward(m->handle.dnn,
+                                     CUDNN_SOFTMAX_ACCURATE,
+                                     CUDNN_SOFTMAX_MODE_CHANNEL,
+                                     &softmax_alpha,
+                                     m->qk_tensor,
+                                     C,
+                                     &softmax_beta,
+                                     m->qk_tensor,
+                                     C_softmax));
+      // Matmul softmax(QK^T/sqrt(d_k)) by V
+      alpha = 1.0f, beta = 0.0f;
+      m_ = num_new_tokens;
+      n = m->vProjSize;
+      k = total_tokens_in_request;
+      lda = m_, ldb = n, ldc = m_;
+      strideA = num_new_tokens * total_tokens_in_request;
+      strideB = vt_block_size;
+      strideC = num_new_tokens * m->vProjSize;
+      // To get A, skip over softmax(QK^T/sqrt(d_k)) entries from previous
+      // requests (all heads)
+      A = C_softmax;
+      // To get B, skip over V^T entries from previous requests (all heads +
+      // padding)
+      B = static_cast<DT *>(m->valueCache) + i * vt_req_block_size;
+      // To get C, skip over softmax(QK^T/sqrt(d_k))V products from previous
+      // requests
+      C = static_cast<DT *>(m->attn_heads) +
+          processed_tokens_in_batch * m->num_q_heads * m->vProjSize;
+
+      if (m->num_q_heads == m->num_kv_heads) {
+        checkCUDA(cublasGemmStridedBatchedEx(m->handle.blas,
+                                             CUBLAS_OP_N,
+                                             CUBLAS_OP_T,
+                                             m_,
+                                             n,
+                                             k,
+                                             &alpha,
+                                             A,
+                                             cublas_data_type,
+                                             lda,
+                                             strideA,
+                                             B,
+                                             cublas_data_type,
+                                             ldb,
+                                             strideB,
+                                             &beta,
+                                             C,
+                                             cublas_data_type,
+                                             ldc,
+                                             strideC,
+                                             m->num_q_heads,
+                                             compute_type,
+                                             CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      } else {
+        int one_step_heads = m->num_q_heads / m->num_kv_heads;
+        strideB = 0;
+        for (int step = 0; step < m->num_kv_heads; step++) {
+          checkCUDA(
+              cublasGemmStridedBatchedEx(m->handle.blas,
+                                         CUBLAS_OP_N,
+                                         CUBLAS_OP_T,
+                                         m_,
+                                         n,
+                                         k,
+                                         &alpha,
+                                         A + step * one_step_heads * strideA,
+                                         cublas_data_type,
+                                         lda,
+                                         strideA,
+                                         B + step * vt_block_size,
+                                         cublas_data_type,
+                                         ldb,
+                                         strideB,
+                                         &beta,
+                                         C + step * one_step_heads * strideC,
+                                         cublas_data_type,
+                                         ldc,
+                                         strideC,
+                                         one_step_heads,
+                                         compute_type,
+                                         CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+        }
+      }
+
+      // Project to output, save result directly on output tensor
+      alpha = 1.0f, beta = 0.0f;
+      m_ = m->oProjSize;
+      k = m->vProjSize * m->num_q_heads;
+      n = num_new_tokens;
+      lda = k, ldb = n, ldc = m_;
+      A = weight_ptr + m->qSize * (m->qProjSize * m->num_q_heads +
+                                   m->kProjSize * m->num_kv_heads +
+                                   m->vProjSize * m->num_kv_heads);
+      B = C;
+      C = static_cast<DT *>(output_ptr) +
+          processed_tokens_in_batch * m->oProjSize;
+
+      checkCUDA(cublasGemmEx(m->handle.blas,
+                             CUBLAS_OP_T,
+                             CUBLAS_OP_T,
+                             m_,
+                             n,
+                             k,
+                             &alpha,
+                             A,
+                             cublas_data_type,
+                             lda,
+                             B,
+                             cublas_data_type,
+                             ldb,
+                             &beta,
+                             C,
+                             cublas_data_type,
+                             ldc,
+                             compute_type,
+                             CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+      processed_tokens_in_batch += num_new_tokens;
+    }
+    // Before moving to the next request
+    // check that we have finished all tokens of the request
+    assert(last_token_idx_of_the_request + 1 == processed_tokens_in_batch);
+  }
+  if (*m->bias && shard_id == 0) {
+    int parallelism = m->oProjSize * processed_tokens_in_batch;
+    int qkv_weight_size = m->qProjSize * m->global_num_q_heads +
+                          m->kProjSize * m->global_num_kv_heads +
+                          m->vProjSize * m->global_num_kv_heads;
+    apply_proj_bias_w<<<GET_BLOCKS(parallelism),
+                        min(CUDA_NUM_THREADS, parallelism),
+                        0,
+                        stream>>>(output_ptr,
+                                  bias_ptr,
+                                  processed_tokens_in_batch,
+                                  qkv_weight_size,
+                                  m->oProjSize);
+  }
+
+  assert(processed_tokens_in_batch == bc->num_active_tokens());
+}
+
+template <typename DT>
+void inference_kernel(TreeIncMultiHeadSelfAttentionMeta *m,
+                      TreeVerifyBatchConfig const *bc,
+                      int shard_id,
+                      DT const *input_ptr,
+                      DT const *weight_ptr,
+                      DT *output_ptr,
+                      DT const *bias_ptr,
+                      cudaStream_t stream) {
+  // additional processing for weight uploading
+  if (m->handle.offload_reserve_space != nullptr) {
+    // Note that we update weight_ptr and bias_ptr when uploading weight and
+    // bias
+    cudaMemcpyAsync(m->weight_ptr,
+                    weight_ptr,
+                    m->weightSize,
+                    cudaMemcpyHostToDevice,
+                    stream);
+    weight_ptr = static_cast<DT *>(m->weight_ptr);
+    if (m->biasSize > 0) {
+      cudaMemcpyAsync(
+          m->bias_ptr, bias_ptr, m->biasSize, cudaMemcpyHostToDevice, stream);
+      bias_ptr = static_cast<DT *>(m->bias_ptr);
+    }
+  }
+  // copy committed tokens info to GPU for the commit_tokens kernel
+  // Note that m->num_active_tokens stores the number of active
+  // tokens in the previous batch, which is needed for committing
+  // keys/values to the key-value cache
+  cudaMemcpyAsync(m->committed_token_infos,
+                  &(bc->committed_tokens),
+                  bc->num_tokens_to_commit *
+                      sizeof(TreeVerifyBatchConfig::CommittedTokensInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  commit_tokens<DT>(m, bc, stream);
+
+  // After commit we update m->num_active_tokens to be the number of active
+  // tokens for the current batch
+  m->num_active_tokens = bc->num_active_tokens();
+
+  // here because we need postion info in infernece 1
+  if (m->offload && m->biasSize > 0) {
+    cudaMemcpyAsync(
+        m->bias_ptr, bias_ptr, m->biasSize, cudaMemcpyHostToDevice, stream);
+    bias_ptr = static_cast<DT *>(m->bias_ptr);
+  }
+  cudaMemcpyAsync(m->token_infos,
+                  &(bc->tokensInfo),
+                  bc->MAX_NUM_TOKENS *
+                      sizeof(TreeVerifyBatchConfig::PerTokenInfo),
+                  cudaMemcpyHostToDevice,
+                  stream);
+  // phase 1: Implement kernel to compute KQV for input tokens
+  compute_qkv_kernel(m,
+                     bc,
+                     shard_id,
+                     input_ptr,
+                     weight_ptr,
+                     static_cast<DT *>(m->devQKVProjArray),
+                     bias_ptr,
+                     stream);
+
+  // phase 2: No need to update key/val cache
+  // IncMultiHeadSelfAttention::update_kv_cache_kernel(
+  //    m, bc, stream);
+
+  // phase 3: Compute attention score
+  // 3 kernels for pahse 3: matmul1 - softmax - matmal2
+  compute_attention_kernel(
+      m, bc, shard_id, output_ptr, bias_ptr, weight_ptr, stream);
+}
+
+} // namespace TreeIncMultiHeadAttention
+} // namespace Kernels
+
+/*static*/
+void TreeIncMultiHeadSelfAttention::inference_kernel_wrapper(
+    TreeIncMultiHeadSelfAttentionMeta *m,
+    TreeVerifyBatchConfig const *bc,
+    int shard_id,
+    GenericTensorAccessorR const &input,
+    GenericTensorAccessorR const &weight,
+    GenericTensorAccessorW const &output,
+    GenericTensorAccessorR const &bias) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  bool use_bias = *m->bias;
+
+  cudaEvent_t t_start, t_end;
+  if (m->profiling) {
+    cudaEventCreate(&t_start);
+    cudaEventCreate(&t_end);
+    cudaEventRecord(t_start, stream);
+  }
+
+  // assert(input.data_type == weight.data_type);
+  assert(input.data_type == output.data_type);
+  if (use_bias) {
+    assert(input.data_type == bias.data_type);
+  }
+
+  if (input.data_type == DT_HALF) {
+    if (m->offload) {
+      pre_build_weight_kernel<half>(m, weight, input.data_type, stream);
+    }
+
+    half const *bias_ptr =
+        use_bias ? bias.get_half_ptr() : static_cast<half const *>(nullptr);
+    Kernels::TreeIncMultiHeadAttention::inference_kernel(
+        m,
+        bc,
+        shard_id,
+        input.get_half_ptr(),
+        m->offload ? static_cast<half *>(m->weight_ptr) : weight.get_half_ptr(),
+        output.get_half_ptr(),
+        bias_ptr,
+        stream);
+  } else if (input.data_type == DT_FLOAT) {
+    if (m->offload) {
+      pre_build_weight_kernel<float>(m, weight, input.data_type, stream);
+    }
+    float const *bias_ptr =
+        use_bias ? bias.get_float_ptr() : static_cast<float const *>(nullptr);
+    Kernels::TreeIncMultiHeadAttention::inference_kernel(
+        m,
+        bc,
+        shard_id,
+        input.get_float_ptr(),
+        m->offload ? static_cast<float *>(m->weight_ptr)
+                   : weight.get_float_ptr(),
+        output.get_float_ptr(),
+        bias_ptr,
+        stream);
+  } else {
+    assert(false && "Unspported data type");
+  }
+
+  if (m->profiling) {
+    cudaEventRecord(t_end, stream);
+    checkCUDA(cudaEventSynchronize(t_end));
+    float elapsed = 0;
+    checkCUDA(cudaEventElapsedTime(&elapsed, t_start, t_end));
+    cudaEventDestroy(t_start);
+    cudaEventDestroy(t_end);
+    printf("TreeIncMultiHeadSelfAttention forward time = %.2fms\n", elapsed);
+    // print_tensor<3, float>(acc_query.ptr, acc_query.rect,
+    // "[Attention:forward:query]"); print_tensor<3, float>(acc_output.ptr,
+    // acc_output.rect, "[Attention:forward:output]");
+  }
+}
+
+TreeIncMultiHeadSelfAttentionMeta::TreeIncMultiHeadSelfAttentionMeta(
+    FFHandler handler,
+    TreeIncMultiHeadSelfAttention const *attn,
+    GenericTensorAccessorR const &weight,
+    MemoryAllocator &gpu_mem_allocator,
+    int num_samples,
+    int _num_q_heads,
+    int _num_kv_heads)
+    : IncMultiHeadSelfAttentionMeta(handler,
+                                    TREE_VERIFY_MODE,
+                                    attn,
+                                    attn->qSize,
+                                    attn->kSize,
+                                    attn->vSize,
+                                    attn->qProjSize,
+                                    attn->kProjSize,
+                                    attn->vProjSize,
+                                    attn->oProjSize,
+                                    attn->apply_rotary_embedding,
+                                    attn->bias,
+                                    attn->scaling_query,
+                                    attn->qk_prod_scaling,
+                                    attn->add_bias_kv,
+                                    attn->scaling_factor,
+                                    weight,
+                                    gpu_mem_allocator,
+                                    num_samples,
+                                    attn->num_q_heads,
+                                    attn->num_kv_heads,
+                                    _num_q_heads,
+                                    _num_kv_heads,
+                                    attn->quantization_type,
+                                    attn->offload),
+      num_active_tokens(0) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDNN(cudnnSetStream(handler.dnn, stream));
+
+  // allocate memory for the seqArray and reserve space
+  {
+    size_t committed_tokeninfo_size = TreeVerifyBatchConfig::MAX_NUM_TOKENS;
+    size_t total_size = committed_tokeninfo_size *
+                        sizeof(TreeVerifyBatchConfig::CommittedTokensInfo);
+    if (offload) {
+      // assert that we have enough reserved work space left
+      assert(gpu_mem_allocator.reserved_total_size -
+                 gpu_mem_allocator.reserved_allocated_size >=
+             total_size);
+      committed_token_infos =
+          gpu_mem_allocator
+              .allocate_reserved<TreeVerifyBatchConfig::CommittedTokensInfo>(
+                  committed_tokeninfo_size);
+    } else {
+      gpu_mem_allocator.create_legion_instance(committed_token_reserve_inst,
+                                               total_size);
+      committed_token_infos =
+          gpu_mem_allocator
+              .allocate_instance<TreeVerifyBatchConfig::CommittedTokensInfo>(
+                  committed_tokeninfo_size);
+    }
+  }
+
+  cudaStreamSynchronize(stream);
+}
+
+TreeIncMultiHeadSelfAttentionMeta::~TreeIncMultiHeadSelfAttentionMeta(void) {
+  if (committed_token_reserve_inst != Realm::RegionInstance::NO_INST) {
+    committed_token_reserve_inst.destroy();
+  }
+}
+
+}; // namespace FlexFlow
diff --git a/src/parallel_ops/allreduce.cc b/src/parallel_ops/allreduce.cc
new file mode 100644
index 0000000000..027d15c929
--- /dev/null
+++ b/src/parallel_ops/allreduce.cc
@@ -0,0 +1,384 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/parallel_ops/allreduce.h"
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/model.h"
+#include "flexflow/parallel_ops/kernels/allreduce_kernels.h"
+#include "flexflow/utils/hash_utils.h"
+
+namespace FlexFlow {
+// declare Legion names
+using Legion::ArgumentMap;
+using Legion::Context;
+using Legion::coord_t;
+using Legion::Domain;
+using Legion::Future;
+using Legion::FutureMap;
+using Legion::IndexLauncher;
+using Legion::LogicalPartition;
+using Legion::LogicalRegion;
+using Legion::Machine;
+using Legion::Memory;
+using Legion::PhysicalRegion;
+using Legion::Predicate;
+using Legion::Rect;
+using Legion::RegionRequirement;
+using Legion::Runtime;
+using Legion::Task;
+using Legion::TaskArgument;
+using Legion::TaskLauncher;
+
+using namespace FlexFlow::Kernels::AllReduce;
+
+/* Params */
+bool operator==(AllReduceParams const &lhs, AllReduceParams const &rhs) {
+  return lhs.allreduce_legion_dim == rhs.allreduce_legion_dim;
+}
+
+bool AllReduceParams::is_valid(ParallelTensorShape const &input) const {
+  return input.is_valid();
+}
+
+AllReduceParams AllReduce::get_params() const {
+  AllReduceParams params;
+  params.allreduce_legion_dim = this->allreduce_dim;
+  return params;
+}
+
+AllReduce::AllReduce(FFModel &model,
+                     const ParallelTensor _input,
+                     int _allreduce_legion_dim,
+                     char const *name)
+    : ParallelOp(model, OP_ALLREDUCE, name, _input),
+      allreduce_dim(_allreduce_legion_dim) {
+  int numdim = _input->num_dims;
+  ParallelDim dims[MAX_TENSOR_DIM];
+  for (int i = 0; i < numdim; i++) {
+    dims[i] = _input->dims[i];
+  }
+  assert(dims[allreduce_dim].degree > 1);
+  // ParallelTensorBase::update_parallel_ids(numdim, dims);
+  outputs[0] = model.create_parallel_tensor_legion_ordering(
+      numdim, dims, _input->data_type, this);
+}
+
+AllReduce::AllReduce(FFModel &model,
+                     AllReduceParams const &params,
+                     ParallelTensor const input,
+                     char const *name)
+    : AllReduce(model, input, params.allreduce_legion_dim, name) {}
+
+void AllReduce::create_input_partition(FFModel &ff) {
+  // Do nothing
+  return;
+}
+
+void AllReduce::create_input_partition_inference(
+    FFModel &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs) {
+  assert(ff.config.computationMode == COMP_MODE_INFERENCE);
+  assert(batch_outputs[0]->part != LogicalPartition::NO_PART);
+  assert(batch_inputs[0]->part != LogicalPartition::NO_PART);
+  // Do nothing
+  return;
+}
+
+OpMeta *AllReduce::init_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+  AllReduce *ar = (AllReduce *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+  AllReduceMeta *meta = new AllReduceMeta(handle, ar);
+  meta->input_type[0] = ar->inputs[0]->data_type;
+  meta->output_type[0] = ar->outputs[0]->data_type;
+  assert(meta->input_type[0] == meta->output_type[0]);
+  return meta;
+}
+
+void AllReduce::init(FFModel const &ff) {
+  ArgumentMap argmap;
+  parallel_is = outputs[0]->parallel_is;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(ALLREDUCE_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(AllReduce)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+void AllReduce::init_inference(FFModel const &ff,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  parallel_is = batch_outputs[0]->parallel_is;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(ALLREDUCE_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(AllReduce)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+FutureMap AllReduce::inference(FFModel const &ff,
+                               BatchConfigFuture const &bc,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  assert(batch_inputs[0]->data_type == batch_outputs[0]->data_type);
+  DataType data_type = batch_inputs[0]->data_type;
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(ALLREDUCE_INF_TASK_ID,
+                         batch_outputs[0]->parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(batch_inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
+void AllReduce::forward(FFModel const &ff) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = outputs[0]->parallel_is;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  set_argumentmap_for_forward(ff, argmap);
+  IndexLauncher launcher(ALLREDUCE_FWD_TASK_ID,
+                         outputs[0]->parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  runtime->execute_index_space(ctx, launcher);
+}
+
+void AllReduce::backward(FFModel const &ff) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  IndexLauncher launcher(ALLREDUCE_BWD_TASK_ID,
+                         inputs[0]->parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         inputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(inputs[0]->part_grad,
+                                                    0 /*projection id*/,
+                                                    READ_WRITE,
+                                                    EXCLUSIVE,
+                                                    inputs[0]->region_grad));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part_grad,
+                                                    0 /*projection id*/,
+                                                    READ_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region_grad));
+  launcher.add_field(1, FID_DATA);
+  runtime->execute_index_space(ctx, launcher);
+}
+
+bool AllReduce::measure_operator_cost(Simulator *sim,
+                                      MachineView const &pc,
+                                      CostMetrics &cost_metrics) const {
+  cost_metrics = CostMetrics();
+  cost_metrics.forward_time = 0.0f;
+  cost_metrics.backward_time = 0.0f;
+
+  cost_metrics.sync_time = 0;
+  cost_metrics.inputs_memory = 0;
+  cost_metrics.outputs_memory = 0;
+  cost_metrics.weights_memory = 0;
+  return true;
+}
+
+bool AllReduce::get_int_parameter(PMParameter para, int *value) const {
+  switch (para) {
+    case PM_ALLREDUCE_DIM:
+      *value = allreduce_dim;
+      return true;
+    default:
+      return Op::get_int_parameter(para, value);
+  }
+}
+
+bool AllReduce::append_parallel_op_info(
+    std::vector<ParallelOpInfo> &parallel_ops) const {
+  ParallelOpInfo ret;
+  ret.op_type = op_type;
+  ret.parallel_dim = allreduce_dim;
+  ret.parallel_degree = -1; // AllReduce does not affect parallel degree
+  parallel_ops.push_back(ret);
+  return true;
+}
+
+/*static*/
+void AllReduce::inference_task(Task const *task,
+                               std::vector<PhysicalRegion> const &regions,
+                               Context ctx,
+                               Runtime *runtime) {
+  assert(regions.size() == 2);
+  assert(task->regions.size() == 2);
+
+  AllReduceMeta const *m = *((AllReduceMeta **)task->local_args);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  assert(input.data_type == output.data_type);
+  inference_kernel_wrapper(m, bc, input, output);
+}
+
+/*static*/
+void AllReduce::forward_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+  assert(regions.size() == 2);
+  assert(task->regions.size() == 2);
+
+  AllReduceMeta const *m = *((AllReduceMeta **)task->local_args);
+
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  assert(input.data_type == output.data_type);
+  forward_kernel_wrapper(m, input, output);
+}
+
+void AllReduce::backward_task(Task const *task,
+                              std::vector<PhysicalRegion> const &regions,
+                              Context ctx,
+                              Runtime *runtime) {
+  assert(regions.size() == 2);
+  assert(task->regions.size() == 2);
+  AllReduceMeta const *m = *((AllReduceMeta **)task->local_args);
+
+  GenericTensorAccessorW input_grad = helperGetGenericTensorAccessorRW(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorR output_grad = helperGetGenericTensorAccessorRO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  assert(input_grad.data_type == output_grad.data_type);
+  backward_kernel_wrapper(m, input_grad, output_grad);
+}
+
+}; // namespace FlexFlow
+
+namespace std {
+size_t hash<FlexFlow::AllReduceParams>::operator()(
+    FlexFlow::AllReduceParams const &params) const {
+  size_t key = 0;
+  hash_combine(key, params.allreduce_legion_dim);
+  return key;
+}
+
+} // namespace std
diff --git a/src/parallel_ops/combine.cc b/src/parallel_ops/combine.cc
index a4169ea306..7c266c5392 100644
--- a/src/parallel_ops/combine.cc
+++ b/src/parallel_ops/combine.cc
@@ -88,7 +88,7 @@ Combine::Combine(FFModel &model,
   dims[combine_dim].degree /= combine_degree;
   ParallelTensorBase::update_parallel_ids(numdim, dims);
   outputs[0] = model.create_parallel_tensor_legion_ordering(
-      numdim, dims, DT_FLOAT, this);
+      numdim, dims, _input->data_type, this);
   // inputs[0]->print("Combine::input");
   // outputs[0]->print("Combine::output");
 }
@@ -97,11 +97,13 @@ OpMeta *Combine::init_task(Task const *task,
                            std::vector<PhysicalRegion> const &regions,
                            Context ctx,
                            Runtime *runtime) {
-  Combine *rep = (Combine *)task->args;
-  // FFHandler handle = *((FFHandler *)task->local_args);
-  // CombineMeta* m = new CombineMeta(handle);
-  // m->data_type = rep->outputs[0]->data_type;
-  return nullptr;
+  Combine *cmb = (Combine *)task->args;
+  FFHandler handle = *((FFHandler *)task->local_args);
+  CombineMeta *m = new CombineMeta(handle);
+  m->input_type[0] = cmb->inputs[0]->data_type;
+  m->output_type[0] = cmb->outputs[0]->data_type;
+  assert(m->input_type[0] == m->output_type[0]);
+  return m;
 }
 
 void Combine::init(FFModel const &ff) {
@@ -111,6 +113,7 @@ void Combine::init(FFModel const &ff) {
   Runtime *runtime = ff.config.lg_hlr;
   assert(numOutputs == 1);
   assert(numInputs == 1);
+  set_argumentmap_for_init(ff, argmap);
   IndexLauncher launcher(COMBINE_INIT_TASK_ID,
                          parallel_is,
                          TaskArgument(this, sizeof(Combine)),
@@ -130,6 +133,48 @@ void Combine::init(FFModel const &ff) {
   launcher.add_field(1, FID_DATA);
   FutureMap fm = runtime->execute_index_space(ctx, launcher);
   fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+void Combine::init_inference(FFModel const &ff,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  parallel_is = batch_outputs[0]->parallel_is;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(COMBINE_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Combine)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  assert(inference_input_lps.find(batch_inputs[0]) !=
+         inference_input_lps.end());
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
 }
 
 void Combine::create_input_partition(FFModel &ff) {
@@ -147,6 +192,61 @@ void Combine::create_input_partition(FFModel &ff) {
                                output_grad_lp);
 }
 
+void Combine::create_input_partition_inference(
+    FFModel &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs) {
+  assert(ff.config.computationMode == COMP_MODE_INFERENCE);
+  assert(batch_outputs[0]->part != LogicalPartition::NO_PART);
+  assert(batch_inputs[0]->part != LogicalPartition::NO_PART);
+  // input_lp is a disjoint partition
+  ff.create_disjoint_partition(batch_outputs[0]->num_dims,
+                               batch_outputs[0]->dims,
+                               batch_outputs[0]->parallel_is,
+                               batch_inputs[0]->region,
+                               inference_input_lps[batch_inputs[0]]);
+}
+
+FutureMap Combine::inference(FFModel const &ff,
+                             BatchConfigFuture const &bc,
+                             std::vector<ParallelTensor> const &batch_inputs,
+                             std::vector<ParallelTensor> const &batch_outputs,
+                             MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  assert(batch_inputs[0]->data_type == batch_outputs[0]->data_type);
+  DataType data_type = batch_inputs[0]->data_type;
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(COMBINE_FWD_TASK_ID,
+                         batch_outputs[0]->parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void Combine::forward(FFModel const &ff) {
   ArgumentMap argmap;
   Context ctx = ff.config.lg_ctx;
@@ -157,7 +257,7 @@ void Combine::forward(FFModel const &ff) {
   DataType data_type = inputs[0]->data_type;
   IndexLauncher launcher(COMBINE_FWD_TASK_ID,
                          outputs[0]->parallel_is,
-                         TaskArgument(&data_type, sizeof(data_type)),
+                         TaskArgument(nullptr, 0),
                          argmap,
                          Predicate::TRUE_PRED,
                          false /*must*/,
@@ -261,8 +361,11 @@ void Combine::forward_task(Task const *task,
                            Runtime *runtime) {
   assert(regions.size() == 2);
   assert(task->regions.size() == 2);
-  DataType data_type = *((DataType *)task->args);
-  if (data_type == DT_FLOAT) {
+  CombineMeta const *m = *((CombineMeta **)task->local_args);
+  DataType data_type = m->input_type[0];
+  if (data_type == DT_HALF) {
+    forward_task_with_type<half>(task, regions, ctx, runtime);
+  } else if (data_type == DT_FLOAT) {
     forward_task_with_type<float>(task, regions, ctx, runtime);
   } else if (data_type == DT_DOUBLE) {
     forward_task_with_type<double>(task, regions, ctx, runtime);
diff --git a/src/parallel_ops/kernels/allreduce_kernels.cpp b/src/parallel_ops/kernels/allreduce_kernels.cpp
new file mode 100644
index 0000000000..8d0d5e97c5
--- /dev/null
+++ b/src/parallel_ops/kernels/allreduce_kernels.cpp
@@ -0,0 +1,57 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/parallel_ops/kernels/allreduce_kernels.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+AllReduceMeta::AllReduceMeta(FFHandler handle, AllReduce const *reduct)
+    : OpMeta(handle) {}
+
+namespace Kernels {
+namespace AllReduce {
+
+void inference_kernel_wrapper(AllReduceMeta const *m,
+                              BatchConfig const *bc,
+                              GenericTensorAccessorR const &input,
+                              GenericTensorAccessorW const &output) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  assert(input.data_type == output.data_type);
+  assert(input.domain == output.domain);
+  assert(false && "To be implemented");
+}
+
+void forward_kernel_wrapper(AllReduceMeta const *m,
+                            GenericTensorAccessorR const &input,
+                            GenericTensorAccessorW const &output) {
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  assert(input.data_type == output.data_type);
+  assert(input.domain == output.domain);
+  assert(false && "To be implemented");
+}
+
+void backward_kernel_wrapper(AllReduceMeta const *m,
+                             GenericTensorAccessorW const &input_grad,
+                             GenericTensorAccessorR const &output_grad) {
+  assert(false && "To be implemented");
+}
+
+} // namespace AllReduce
+} // namespace Kernels
+} // namespace FlexFlow
diff --git a/src/parallel_ops/kernels/allreduce_kernels.cu b/src/parallel_ops/kernels/allreduce_kernels.cu
new file mode 100644
index 0000000000..2c000137a1
--- /dev/null
+++ b/src/parallel_ops/kernels/allreduce_kernels.cu
@@ -0,0 +1,80 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/parallel_ops/kernels/allreduce_kernels.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+
+AllReduceMeta::AllReduceMeta(FFHandler handle, AllReduce const *reduct)
+    : OpMeta(handle) {}
+
+namespace Kernels {
+namespace AllReduce {
+
+void inference_kernel_wrapper(AllReduceMeta const *m,
+                              BatchConfig const *bc,
+                              GenericTensorAccessorR const &input,
+                              GenericTensorAccessorW const &output) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  assert(input.data_type == output.data_type);
+  assert(input.domain == output.domain);
+  size_t hidden_dim_size = input.domain.hi()[0] - input.domain.lo()[0] + 1;
+  size_t num_elements = bc->num_tokens * hidden_dim_size;
+#ifdef FF_USE_NCCL
+  ncclDataType_t nccl_data_type = ff_to_nccl_datatype(input.data_type);
+  checkNCCL(ncclAllReduce(input.ptr,
+                          output.ptr,
+                          num_elements,
+                          nccl_data_type,
+                          ncclSum,
+                          m->handle.ncclComm,
+                          stream));
+#else
+  assert(false && "Must enable FF_USE_NCCL to use AllReduce operators");
+#endif
+}
+
+void forward_kernel_wrapper(AllReduceMeta const *m,
+                            GenericTensorAccessorR const &input,
+                            GenericTensorAccessorW const &output) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  assert(input.data_type == output.data_type);
+  assert(input.domain == output.domain);
+#ifdef FF_USE_NCCL
+  ncclDataType_t nccl_data_type = ff_to_nccl_datatype(input.data_type);
+  checkNCCL(ncclAllReduce(input.ptr,
+                          output.ptr,
+                          input.domain.get_volume(),
+                          nccl_data_type,
+                          ncclSum,
+                          m->handle.ncclComm,
+                          stream));
+#else
+  assert(false && "Must enable FF_USE_NCCL to use AllReduce operators");
+#endif
+}
+
+void backward_kernel_wrapper(AllReduceMeta const *m,
+                             GenericTensorAccessorW const &input_grad,
+                             GenericTensorAccessorR const &output_grad) {
+  assert(false && "To be implemented");
+}
+
+} // namespace AllReduce
+} // namespace Kernels
+} // namespace FlexFlow
diff --git a/src/parallel_ops/kernels/combine_kernels.cpp b/src/parallel_ops/kernels/combine_kernels.cpp
index 2d748cfab3..d6e9568223 100644
--- a/src/parallel_ops/kernels/combine_kernels.cpp
+++ b/src/parallel_ops/kernels/combine_kernels.cpp
@@ -51,6 +51,9 @@ void backward_kernel(T const *output_grad_ptr,
                      num_elements);
 }
 
+template void forward_kernel<half>(half const *input_ptr,
+                                   half *output_ptr,
+                                   size_t num_elements);
 template void forward_kernel<float>(float const *input_ptr,
                                     float *output_ptr,
                                     size_t num_elements);
@@ -63,6 +66,9 @@ template void forward_kernel<int32_t>(int32_t const *input_ptr,
 template void forward_kernel<int64_t>(int64_t const *input_ptr,
                                       int64_t *output_ptr,
                                       size_t num_elements);
+template void backward_kernel<half>(half const *output_grad_ptr,
+                                    half *input_grad_ptr,
+                                    size_t num_elements);
 template void backward_kernel<float>(float const *output_grad_ptr,
                                      float *input_grad_ptr,
                                      size_t num_elements);
diff --git a/src/parallel_ops/kernels/combine_kernels.cu b/src/parallel_ops/kernels/combine_kernels.cu
index d8f414ef0f..1ab79a7944 100644
--- a/src/parallel_ops/kernels/combine_kernels.cu
+++ b/src/parallel_ops/kernels/combine_kernels.cu
@@ -44,6 +44,9 @@ void backward_kernel(T const *output_grad_ptr,
       input_grad_ptr, output_grad_ptr, num_elements);
 }
 
+template void forward_kernel<half>(half const *input_ptr,
+                                   half *output_ptr,
+                                   size_t num_elements);
 template void forward_kernel<float>(float const *input_ptr,
                                     float *output_ptr,
                                     size_t num_elements);
@@ -56,6 +59,9 @@ template void forward_kernel<int32_t>(int32_t const *input_ptr,
 template void forward_kernel<int64_t>(int64_t const *input_ptr,
                                       int64_t *output_ptr,
                                       size_t num_elements);
+template void backward_kernel<half>(half const *output_grad_ptr,
+                                    half *input_grad_ptr,
+                                    size_t num_elements);
 template void backward_kernel<float>(float const *output_grad_ptr,
                                      float *input_grad_ptr,
                                      size_t num_elements);
diff --git a/src/parallel_ops/kernels/reduction_kernels.cpp b/src/parallel_ops/kernels/reduction_kernels.cpp
index 9143fee936..2a3fe5cca1 100644
--- a/src/parallel_ops/kernels/reduction_kernels.cpp
+++ b/src/parallel_ops/kernels/reduction_kernels.cpp
@@ -18,6 +18,10 @@
 #include <hip/hip_runtime.h>
 
 namespace FlexFlow {
+
+ReductionMeta::ReductionMeta(FFHandler handle, Reduction const *reduct)
+    : OpMeta(handle) {}
+
 namespace Kernels {
 namespace Reduction {
 
@@ -70,10 +74,18 @@ template __global__ void reduction_forward_kernel<float>(float const *input_ptr,
                                                          float *output_ptr,
                                                          size_t num_elements,
                                                          size_t num_replicas);
+template __global__ void reduction_forward_kernel<half>(half const *input_ptr,
+                                                        half *output_ptr,
+                                                        size_t num_elements,
+                                                        size_t num_replicas);
 template void forward_kernel<float>(float const *input_ptr,
                                     float *output_ptr,
                                     size_t num_elements,
                                     size_t num_replicas);
+template void forward_kernel<half>(half const *input_ptr,
+                                   half *output_ptr,
+                                   size_t num_elements,
+                                   size_t num_replicas);
 template void backward_kernel<float>(float const *output_grad_ptr,
                                      float *input_grad_ptr,
                                      size_t num_elements);
diff --git a/src/parallel_ops/kernels/reduction_kernels.cu b/src/parallel_ops/kernels/reduction_kernels.cu
index 8496a107e3..34ae8007da 100644
--- a/src/parallel_ops/kernels/reduction_kernels.cu
+++ b/src/parallel_ops/kernels/reduction_kernels.cu
@@ -17,6 +17,10 @@
 #include "flexflow/utils/cuda_helper.h"
 
 namespace FlexFlow {
+
+ReductionMeta::ReductionMeta(FFHandler handle, Reduction const *reduct)
+    : OpMeta(handle) {}
+
 namespace Kernels {
 namespace Reduction {
 
@@ -63,10 +67,18 @@ template __global__ void reduction_forward_kernel<float>(float const *input_ptr,
                                                          float *output_ptr,
                                                          size_t num_elements,
                                                          size_t num_replicas);
+template __global__ void reduction_forward_kernel<half>(half const *input_ptr,
+                                                        half *output_ptr,
+                                                        size_t num_elements,
+                                                        size_t num_replicas);
 template void forward_kernel<float>(float const *input_ptr,
                                     float *output_ptr,
                                     size_t num_elements,
                                     size_t num_replicas);
+template void forward_kernel<half>(half const *input_ptr,
+                                   half *output_ptr,
+                                   size_t num_elements,
+                                   size_t num_replicas);
 template void backward_kernel<float>(float const *output_grad_ptr,
                                      float *input_grad_ptr,
                                      size_t num_elements);
diff --git a/src/parallel_ops/kernels/replicate_kernels.cpp b/src/parallel_ops/kernels/replicate_kernels.cpp
index 29f1d30d1f..1647f014be 100644
--- a/src/parallel_ops/kernels/replicate_kernels.cpp
+++ b/src/parallel_ops/kernels/replicate_kernels.cpp
@@ -18,6 +18,10 @@
 #include <hip/hip_runtime.h>
 
 namespace FlexFlow {
+
+ReplicateMeta::ReplicateMeta(FFHandler handle, Replicate const *repl)
+    : OpMeta(handle) {}
+
 namespace Kernels {
 namespace Replicate {
 
@@ -66,6 +70,9 @@ void backward_kernel(T const *output_grad_ptr,
 template void forward_kernel<float>(float const *input_ptr,
                                     float *output_ptr,
                                     size_t num_elements);
+template void forward_kernel<half>(half const *input_ptr,
+                                   half *output_ptr,
+                                   size_t num_elements);
 template __global__ void
     replicate_backward_kernel<float>(float const *input_ptr,
                                      float *output_ptr,
diff --git a/src/parallel_ops/kernels/replicate_kernels.cu b/src/parallel_ops/kernels/replicate_kernels.cu
index de208d2aed..35bc109bd3 100644
--- a/src/parallel_ops/kernels/replicate_kernels.cu
+++ b/src/parallel_ops/kernels/replicate_kernels.cu
@@ -17,6 +17,10 @@
 #include "flexflow/utils/cuda_helper.h"
 
 namespace FlexFlow {
+
+ReplicateMeta::ReplicateMeta(FFHandler handle, Replicate const *repl)
+    : OpMeta(handle) {}
+
 namespace Kernels {
 namespace Replicate {
 
@@ -59,6 +63,9 @@ void backward_kernel(T const *output_grad_ptr,
 template void forward_kernel<float>(float const *input_ptr,
                                     float *output_ptr,
                                     size_t num_elements);
+template void forward_kernel<half>(half const *input_ptr,
+                                   half *output_ptr,
+                                   size_t num_elements);
 template __global__ void
     replicate_backward_kernel<float>(float const *input_ptr,
                                      float *output_ptr,
diff --git a/src/parallel_ops/partition.cc b/src/parallel_ops/partition.cc
index 727ffd3264..353b3ce398 100644
--- a/src/parallel_ops/partition.cc
+++ b/src/parallel_ops/partition.cc
@@ -101,6 +101,46 @@ OpMeta *Repartition::init_task(Task const *task,
   return nullptr;
 }
 
+void Repartition::init_inference(
+    FFModel const &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs,
+    MachineView const *mv) {
+  ArgumentMap argmap;
+  parallel_is = batch_outputs[0]->parallel_is;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  IndexLauncher launcher(REPARTITION_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  assert(inference_input_lps.find(batch_inputs[0]) !=
+         inference_input_lps.end());
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+}
+
 void Repartition::init(FFModel const &ff) {
   ArgumentMap argmap;
   parallel_is = outputs[0]->parallel_is;
@@ -130,6 +170,7 @@ void Repartition::init(FFModel const &ff) {
 }
 
 void Repartition::create_input_partition(FFModel &ff) {
+  assert(ff.config.computationMode == COMP_MODE_TRAINING);
   assert(outputs[0]->part != LogicalPartition::NO_PART);
   assert(inputs[0]->part != LogicalPartition::NO_PART);
   ff.create_disjoint_partition(outputs[0]->num_dims,
@@ -144,6 +185,61 @@ void Repartition::create_input_partition(FFModel &ff) {
                                output_grad_lp);
 }
 
+void Repartition::create_input_partition_inference(
+    FFModel &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs) {
+  assert(ff.config.computationMode == COMP_MODE_INFERENCE);
+  assert(batch_outputs[0]->part != LogicalPartition::NO_PART);
+  assert(batch_inputs[0]->part != LogicalPartition::NO_PART);
+  ff.create_disjoint_partition(batch_outputs[0]->num_dims,
+                               batch_outputs[0]->dims,
+                               batch_outputs[0]->parallel_is,
+                               batch_inputs[0]->region,
+                               inference_input_lps[batch_inputs[0]]);
+}
+
+FutureMap
+    Repartition::inference(FFModel const &ff,
+                           BatchConfigFuture const &bc,
+                           std::vector<ParallelTensor> const &batch_inputs,
+                           std::vector<ParallelTensor> const &batch_outputs,
+                           MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  assert(batch_inputs[0]->data_type == batch_outputs[0]->data_type);
+  DataType data_type = batch_inputs[0]->data_type;
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  /* std::cout << "Partition op machine_view: " << *(MachineView const *)mv
+            << std::endl; */
+  IndexLauncher launcher(REPARTITION_FWD_TASK_ID,
+                         batch_outputs[0]->parallel_is,
+                         TaskArgument(&data_type, sizeof(DataType)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
+}
+
 void Repartition::forward(FFModel const &ff) {
   ArgumentMap argmap;
   Context ctx = ff.config.lg_ctx;
diff --git a/src/parallel_ops/reduction.cc b/src/parallel_ops/reduction.cc
index 737f86239c..5dca591328 100644
--- a/src/parallel_ops/reduction.cc
+++ b/src/parallel_ops/reduction.cc
@@ -14,6 +14,7 @@
  */
 
 #include "flexflow/parallel_ops/reduction.h"
+#include "flexflow/ffconst_utils.h"
 #include "flexflow/model.h"
 #include "flexflow/parallel_ops/kernels/reduction_kernels.h"
 #include "flexflow/utils/hash_utils.h"
@@ -77,7 +78,7 @@ Reduction::Reduction(FFModel &model,
   dims[reduction_dim].size /= reduction_degree;
   ParallelTensorBase::update_parallel_ids(numdim, dims);
   outputs[0] = model.create_parallel_tensor_legion_ordering(
-      numdim, dims, DT_FLOAT, this);
+      numdim, dims, _input->data_type, this);
 }
 
 Reduction::Reduction(FFModel &model,
@@ -108,16 +109,153 @@ void Reduction::create_input_partition(FFModel &ff) {
                               output_grad_lp);
 }
 
+void Reduction::create_input_partition_inference(
+    FFModel &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs) {
+  assert(ff.config.computationMode == COMP_MODE_INFERENCE);
+  assert(batch_outputs[0]->part != LogicalPartition::NO_PART);
+  assert(batch_inputs[0]->part != LogicalPartition::NO_PART);
+  // input_lp is a disjoint partition
+  ff.create_disjoint_partition(batch_outputs[0]->num_dims,
+                               batch_outputs[0]->dims,
+                               batch_outputs[0]->parallel_is,
+                               batch_inputs[0]->region,
+                               inference_input_lps[batch_inputs[0]]);
+}
+
+OpMeta *Reduction::init_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+  Reduction *reduct = (Reduction *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+  ReductionMeta *meta = new ReductionMeta(handle, reduct);
+  meta->input_type[0] = reduct->inputs[0]->data_type;
+  meta->output_type[0] = reduct->outputs[0]->data_type;
+  assert(meta->input_type[0] == meta->output_type[0]);
+  return meta;
+}
+
 void Reduction::init(FFModel const &ff) {
-  forward(ff);
+  ArgumentMap argmap;
+  parallel_is = outputs[0]->parallel_is;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(REDUCTION_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Reduction)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         outputs[0]->machine_view.hash());
+  launcher.add_region_requirement(RegionRequirement(
+      input_lp, 0 /*projection id*/, READ_ONLY, EXCLUSIVE, inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+void Reduction::init_inference(FFModel const &ff,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  parallel_is = batch_outputs[0]->parallel_is;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(REDUCTION_INIT_TASK_ID,
+                         parallel_is,
+                         TaskArgument(this, sizeof(Reduction)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  assert(inference_input_lps.find(batch_inputs[0]) !=
+         inference_input_lps.end());
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
+FutureMap Reduction::inference(FFModel const &ff,
+                               BatchConfigFuture const &bc,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  assert(batch_inputs[0]->data_type == batch_outputs[0]->data_type);
+  DataType data_type = batch_inputs[0]->data_type;
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(REDUCTION_FWD_TASK_ID,
+                         batch_outputs[0]->parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
 }
 
 void Reduction::forward(FFModel const &ff) {
   ArgumentMap argmap;
   Context ctx = ff.config.lg_ctx;
   Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = outputs[0]->parallel_is;
   assert(numOutputs == 1);
   assert(numInputs == 1);
+  set_argumentmap_for_forward(ff, argmap);
   IndexLauncher launcher(REDUCTION_FWD_TASK_ID,
                          outputs[0]->parallel_is,
                          TaskArgument(NULL, 0),
@@ -211,6 +349,9 @@ void Reduction::forward_task(Task const *task,
                              Runtime *runtime) {
   assert(regions.size() == 2);
   assert(task->regions.size() == 2);
+
+  ReductionMeta const *m = *((ReductionMeta **)task->local_args);
+
   Domain input_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
   Domain output_domain = runtime->get_index_space_domain(
@@ -222,12 +363,26 @@ void Reduction::forward_task(Task const *task,
   }
   size_t num_elements = output_domain.get_volume();
   size_t num_replicas = input_domain.get_volume() / num_elements;
-  float const *input_ptr = helperGetTensorPointerRO<float>(
-      regions[0], task->regions[0], FID_DATA, ctx, runtime);
-  float *output_ptr = helperGetTensorPointerRW<float>(
-      regions[1], task->regions[1], FID_DATA, ctx, runtime);
 
-  forward_kernel<float>(input_ptr, output_ptr, num_elements, num_replicas);
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  assert(input.data_type == output.data_type);
+  if (input.data_type == DT_HALF) {
+    forward_kernel<half>(input.get_half_ptr(),
+                         output.get_half_ptr(),
+                         num_elements,
+                         num_replicas);
+  } else if (input.data_type == DT_FLOAT) {
+    forward_kernel<float>(input.get_float_ptr(),
+                          output.get_float_ptr(),
+                          num_elements,
+                          num_replicas);
+  } else {
+    assert(false && "Unspported data type");
+  }
 }
 
 void Reduction::backward_task(Task const *task,
diff --git a/src/parallel_ops/replicate.cc b/src/parallel_ops/replicate.cc
index fee78043bd..20face74e8 100644
--- a/src/parallel_ops/replicate.cc
+++ b/src/parallel_ops/replicate.cc
@@ -75,7 +75,7 @@ Replicate::Replicate(FFModel &model,
   dims[replicate_dim].degree *= replicate_degree;
   ParallelTensorBase::update_parallel_ids(numdim, dims);
   outputs[0] = model.create_parallel_tensor_legion_ordering(
-      numdim, dims, DT_FLOAT, this);
+      numdim, dims, _input->data_type, this);
   // inputs[0]->print("Replicate::input");
   // outputs[0]->print("Replicate::output");
 }
@@ -108,16 +108,85 @@ void Replicate::create_input_partition(FFModel &ff) {
                                output_grad_lp);
 }
 
+void Replicate::create_input_partition_inference(
+    FFModel &ff,
+    std::vector<ParallelTensor> const &batch_inputs,
+    std::vector<ParallelTensor> const &batch_outputs) {
+  assert(ff.config.computationMode == COMP_MODE_INFERENCE);
+  assert(batch_outputs[0]->part != LogicalPartition::NO_PART);
+  assert(batch_inputs[0]->part != LogicalPartition::NO_PART);
+  // input_lp is an aliased partitioning along the replica dim
+  ff.create_aliased_partition(batch_outputs[0]->num_dims,
+                              batch_outputs[0]->dims,
+                              replicate_dim,
+                              batch_outputs[0]->parallel_is,
+                              batch_inputs[0]->region,
+                              inference_input_lps[batch_inputs[0]]);
+}
+
+OpMeta *Replicate::init_task(Task const *task,
+                             std::vector<PhysicalRegion> const &regions,
+                             Context ctx,
+                             Runtime *runtime) {
+  Replicate *repl = (Replicate *)task->args;
+  FFHandler handle = *((FFHandler const *)task->local_args);
+  ReplicateMeta *meta = new ReplicateMeta(handle, repl);
+  meta->input_type[0] = repl->inputs[0]->data_type;
+  meta->output_type[0] = repl->outputs[0]->data_type;
+  assert(meta->input_type[0] == meta->output_type[0]);
+  return meta;
+}
+
+void Replicate::init_inference(FFModel const &ff,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  parallel_is = batch_outputs[0]->parallel_is;
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_init_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(REPLICATE_INIT_TASK_ID,
+                         batch_outputs[0]->parallel_is,
+                         TaskArgument(this, sizeof(Replicate)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap_inference(ff, fm, batch_outputs[0]);
+}
+
 void Replicate::init(FFModel const &ff) {
-  // Do nothing
+  parallel_is = outputs[0]->parallel_is;
   ArgumentMap argmap;
   Context ctx = ff.config.lg_ctx;
   Runtime *runtime = ff.config.lg_hlr;
   assert(numOutputs == 1);
   assert(numInputs == 1);
-  IndexLauncher launcher(REPLICATE_FWD_TASK_ID,
+  set_argumentmap_for_init(ff, argmap);
+  IndexLauncher launcher(REPLICATE_INIT_TASK_ID,
                          outputs[0]->parallel_is,
-                         TaskArgument(NULL, 0),
+                         TaskArgument(this, sizeof(Replicate)),
                          argmap,
                          Predicate::TRUE_PRED,
                          false /*must*/,
@@ -132,15 +201,58 @@ void Replicate::init(FFModel const &ff) {
                                                     EXCLUSIVE,
                                                     outputs[0]->region));
   launcher.add_field(1, FID_DATA);
-  runtime->execute_index_space(ctx, launcher);
+  FutureMap fm = runtime->execute_index_space(ctx, launcher);
+  fm.wait_all_results();
+  set_opmeta_from_futuremap(ff, fm);
+}
+
+FutureMap Replicate::inference(FFModel const &ff,
+                               BatchConfigFuture const &bc,
+                               std::vector<ParallelTensor> const &batch_inputs,
+                               std::vector<ParallelTensor> const &batch_outputs,
+                               MachineView const *mv) {
+  ArgumentMap argmap;
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = batch_outputs[0]->parallel_is;
+  assert(numOutputs == 1);
+  assert(numInputs == 1);
+  DataType data_type = batch_inputs[0]->data_type;
+  size_t machine_view_hash =
+      mv ? mv->hash() : batch_outputs[0]->machine_view.hash();
+  set_argumentmap_for_inference(ff, argmap, batch_outputs[0]);
+  IndexLauncher launcher(REPLICATE_FWD_TASK_ID,
+                         batch_outputs[0]->parallel_is,
+                         TaskArgument(NULL, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_region_requirement(
+      RegionRequirement(inference_input_lps[batch_inputs[0]],
+                        0 /*projection id*/,
+                        READ_ONLY,
+                        EXCLUSIVE,
+                        batch_inputs[0]->region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(RegionRequirement(batch_outputs[0]->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    batch_outputs[0]->region));
+  launcher.add_field(1, FID_DATA);
+  return runtime->execute_index_space(ctx, launcher);
 }
 
 void Replicate::forward(FFModel const &ff) {
   ArgumentMap argmap;
   Context ctx = ff.config.lg_ctx;
   Runtime *runtime = ff.config.lg_hlr;
+  parallel_is = outputs[0]->parallel_is;
   assert(numOutputs == 1);
   assert(numInputs == 1);
+  set_argumentmap_for_forward(ff, argmap);
   IndexLauncher launcher(REPLICATE_FWD_TASK_ID,
                          outputs[0]->parallel_is,
                          TaskArgument(NULL, 0),
@@ -233,6 +345,9 @@ void Replicate::forward_task(Task const *task,
                              Runtime *runtime) {
   assert(regions.size() == 2);
   assert(task->regions.size() == 2);
+
+  ReplicateMeta const *m = *((ReplicateMeta **)task->local_args);
+
   Domain input_domain = runtime->get_index_space_domain(
       ctx, task->regions[0].region.get_index_space());
   Domain output_domain = runtime->get_index_space_domain(
@@ -243,12 +358,24 @@ void Replicate::forward_task(Task const *task,
     assert(output_domain.hi()[i] == input_domain.hi()[i]);
   }
   assert(input_domain.get_volume() == output_domain.get_volume());
-  float const *input_ptr = helperGetTensorPointerRO<float>(
-      regions[0], task->regions[0], FID_DATA, ctx, runtime);
-  float *output_ptr = helperGetTensorPointerRW<float>(
-      regions[1], task->regions[1], FID_DATA, ctx, runtime);
 
-  forward_kernel<float>(input_ptr, output_ptr, input_domain.get_volume());
+  GenericTensorAccessorR input = helperGetGenericTensorAccessorRO(
+      m->input_type[0], regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  GenericTensorAccessorW output = helperGetGenericTensorAccessorWO(
+      m->output_type[0], regions[1], task->regions[1], FID_DATA, ctx, runtime);
+
+  assert(input.data_type == output.data_type);
+
+  if (input.data_type == DT_HALF) {
+    forward_kernel<half>(
+        input.get_half_ptr(), output.get_half_ptr(), input_domain.get_volume());
+  } else if (input.data_type == DT_FLOAT) {
+    forward_kernel<float>(input.get_float_ptr(),
+                          output.get_float_ptr(),
+                          input_domain.get_volume());
+  } else {
+    assert(false && "Unspported data type");
+  }
 }
 
 void Replicate::backward_task(Task const *task,
diff --git a/src/runtime/accessor.cc b/src/runtime/accessor.cc
index 809d608402..d3b94bf14a 100644
--- a/src/runtime/accessor.cc
+++ b/src/runtime/accessor.cc
@@ -77,6 +77,15 @@ half const *GenericTensorAccessorR::get_half_ptr() const {
   }
 }
 
+char const *GenericTensorAccessorR::get_byte_ptr() const {
+  if (data_type == DT_INT4 || data_type == DT_INT8) {
+    return static_cast<char const *>(ptr);
+  } else {
+    assert(false && "Invalid Accessor Type");
+    return static_cast<char const *>(nullptr);
+  }
+}
+
 template <typename DT, int dim>
 TensorAccessorW<DT, dim>::TensorAccessorW(PhysicalRegion region,
                                           RegionRequirement req,
@@ -156,6 +165,15 @@ half *GenericTensorAccessorW::get_half_ptr() const {
   }
 }
 
+char *GenericTensorAccessorW::get_byte_ptr() const {
+  if (data_type == DT_INT4 || data_type == DT_INT8) {
+    return static_cast<char *>(ptr);
+  } else {
+    assert(false && "Invalid Accessor Type");
+    return static_cast<char *>(nullptr);
+  }
+}
+
 template <typename DT>
 const DT *helperGetTensorPointerRO(PhysicalRegion region,
                                    RegionRequirement req,
@@ -261,6 +279,14 @@ GenericTensorAccessorR
       ptr = helperGetTensorPointerRO<double>(region, req, fid, ctx, runtime);
       break;
     }
+    case DT_INT4: {
+      ptr = helperGetTensorPointerRO<char>(region, req, fid, ctx, runtime);
+      break;
+    }
+    case DT_INT8: {
+      ptr = helperGetTensorPointerRO<char>(region, req, fid, ctx, runtime);
+      break;
+    }
     default: {
       assert(false);
     }
@@ -299,6 +325,14 @@ GenericTensorAccessorW
       ptr = helperGetTensorPointerWO<double>(region, req, fid, ctx, runtime);
       break;
     }
+    case DT_INT4: {
+      ptr = helperGetTensorPointerWO<char>(region, req, fid, ctx, runtime);
+      break;
+    }
+    case DT_INT8: {
+      ptr = helperGetTensorPointerWO<char>(region, req, fid, ctx, runtime);
+      break;
+    }
     default: {
       assert(false);
     }
@@ -337,6 +371,14 @@ GenericTensorAccessorW
       ptr = helperGetTensorPointerRW<double>(region, req, fid, ctx, runtime);
       break;
     }
+    case DT_INT4: {
+      ptr = helperGetTensorPointerRW<char>(region, req, fid, ctx, runtime);
+      break;
+    }
+    case DT_INT8: {
+      ptr = helperGetTensorPointerRW<char>(region, req, fid, ctx, runtime);
+      break;
+    }
     default: {
       assert(false);
     }
@@ -345,10 +387,14 @@ GenericTensorAccessorW
 }
 
 #define DIMFUNC(DIM)                                                           \
+  template class TensorAccessorR<char, DIM>;                                   \
+  template class TensorAccessorR<half, DIM>;                                   \
   template class TensorAccessorR<float, DIM>;                                  \
   template class TensorAccessorR<double, DIM>;                                 \
   template class TensorAccessorR<int32_t, DIM>;                                \
   template class TensorAccessorR<int64_t, DIM>;                                \
+  template class TensorAccessorW<char, DIM>;                                   \
+  template class TensorAccessorW<half, DIM>;                                   \
   template class TensorAccessorW<float, DIM>;                                  \
   template class TensorAccessorW<double, DIM>;                                 \
   template class TensorAccessorW<int32_t, DIM>;                                \
@@ -371,6 +417,22 @@ template half *helperGetTensorPointerWO(PhysicalRegion region,
                                         Context ctx,
                                         Runtime *runtime);
 
+template char const *helperGetTensorPointerRO(PhysicalRegion region,
+                                              RegionRequirement req,
+                                              FieldID fid,
+                                              Context ctx,
+                                              Runtime *runtime);
+template char *helperGetTensorPointerRW(PhysicalRegion region,
+                                        RegionRequirement req,
+                                        FieldID fid,
+                                        Context ctx,
+                                        Runtime *runtime);
+template char *helperGetTensorPointerWO(PhysicalRegion region,
+                                        RegionRequirement req,
+                                        FieldID fid,
+                                        Context ctx,
+                                        Runtime *runtime);
+
 template float const *helperGetTensorPointerRO(PhysicalRegion region,
                                                RegionRequirement req,
                                                FieldID fid,
diff --git a/src/runtime/batch_config.cc b/src/runtime/batch_config.cc
new file mode 100644
index 0000000000..d658b6590f
--- /dev/null
+++ b/src/runtime/batch_config.cc
@@ -0,0 +1,113 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/batch_config.h"
+#include "legion.h"
+#include <cassert>
+#include <climits>
+
+namespace FlexFlow {
+
+LegionRuntime::Logger::Category log_bc("BatchConfig");
+using Legion::Future;
+using Legion::Memory;
+
+BatchConfig::BatchConfig() : num_tokens(0) {
+  for (int i = 0; i < MAX_NUM_REQUESTS; i++) {
+    requestsInfo[i].token_start_offset = 0;
+    requestsInfo[i].num_tokens_in_batch = 0;
+    request_completed[i] = true;
+  }
+  for (int i = 0; i < MAX_NUM_TOKENS; i++) {
+    tokensInfo[i].abs_depth_in_request = 0;
+    tokensInfo[i].request_index = 0;
+    tokensInfo[i].token_id = 0;
+  }
+}
+
+/*static*/
+BatchConfig const *BatchConfig::from_future(BatchConfigFuture const &future) {
+  BatchConfig const *bc = static_cast<BatchConfig const *>(
+      Future(future).get_buffer(Memory::SYSTEM_MEM));
+  // Check future size
+  if (bc->get_mode() == INC_DECODING_MODE) {
+    assert(Future(future).get_untyped_size() == sizeof(BatchConfig));
+  } else if (bc->get_mode() == BEAM_SEARCH_MODE) {
+    assert(Future(future).get_untyped_size() == sizeof(BeamSearchBatchConfig));
+  } else if (bc->get_mode() == TREE_VERIFY_MODE) {
+    assert(Future(future).get_untyped_size() == sizeof(TreeVerifyBatchConfig));
+  } else {
+    assert(false && "Unsupported inference mode");
+  }
+  return bc;
+}
+
+InferenceMode BatchConfig::get_mode() const {
+  return INC_DECODING_MODE;
+}
+
+int BatchConfig::num_active_requests() const {
+  int num_requests = 0;
+  for (int i = 0; i < MAX_NUM_REQUESTS; i++) {
+    if (!request_completed[i]) {
+      num_requests++;
+    }
+  }
+  return num_requests;
+}
+
+int BatchConfig::num_active_tokens() const {
+  return num_tokens;
+}
+
+void BatchConfig::print() const {
+  std::cout << "@@@@@@@@@@@@@@ Batch Config (mode " << get_mode()
+            << ") @@@@@@@@@@@@@@" << std::endl;
+  std::cout << "Max number of requests: " << MAX_NUM_REQUESTS << std::endl;
+  std::cout << "Max number of tokens: " << MAX_NUM_TOKENS << std::endl;
+  std::cout << "Number of tokens: " << num_tokens << std::endl;
+  std::cout << "Number of requests: " << num_active_requests() << std::endl;
+  // std::cout << "Cached results: " << cached_results << std::endl;
+
+  std::cout << "Per-request info:\n";
+  for (int i = 0; i < MAX_NUM_REQUESTS; i++) {
+    if (!request_completed[i]) {
+      std::cout << "  Request " << i << ":\n";
+      std::cout << "    Token start offset: "
+                << requestsInfo[i].token_start_offset << std::endl;
+      std::cout << "    Number of tokens in batch: "
+                << requestsInfo[i].num_tokens_in_batch << std::endl;
+      std::cout << "    GUID: " << requestsInfo[i].request_guid << std::endl;
+      std::cout << "    Max sequence length: "
+                << requestsInfo[i].max_sequence_length << std::endl;
+      std::cout << "    Request completed: " << request_completed[i]
+                << std::endl;
+    }
+  }
+
+  std::cout << "Per-token info:\n";
+  for (int i = 0; i < num_tokens; i++) {
+    std::cout << "  Token " << i << ":\n";
+    std::cout << "    Absolute depth in request: "
+              << tokensInfo[i].abs_depth_in_request << std::endl;
+    std::cout << "    Request index: " << tokensInfo[i].request_index
+              << std::endl;
+    std::cout << "    Token id: " << tokensInfo[i].token_id << std::endl;
+  }
+  std::cout << "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
+            << std::endl;
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/beam_search_batch_config.cc b/src/runtime/beam_search_batch_config.cc
new file mode 100644
index 0000000000..dc30d89d78
--- /dev/null
+++ b/src/runtime/beam_search_batch_config.cc
@@ -0,0 +1,171 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/batch_config.h"
+#include "legion.h"
+#include <cassert>
+#include <climits>
+
+#define DEFAULT_BEAM_WIDTH 1
+#define DEFAULT_TARGET_ITERATIONS 3
+
+namespace FlexFlow {
+
+LegionRuntime::Logger::Category log_beam_bc("BeamSearchBatchConfig");
+
+BeamSearchBatchConfig::BeamSearchBatchConfig() : BatchConfig() {
+  this->beam_width = DEFAULT_BEAM_WIDTH;
+  this->target_iterations = DEFAULT_TARGET_ITERATIONS;
+  current_iteration = 0;
+}
+
+BeamSearchBatchConfig::BeamSearchBatchConfig(int model_id) : BatchConfig() {
+  this->model_id = model_id;
+  std::cout << "==================\n"
+            << "Register Batch Config with Model " << this->model_id
+            << std::endl;
+  current_iteration = 0;
+}
+
+BeamSearchBatchConfig::BeamSearchBatchConfig(size_t beam_width,
+                                             size_t target_iterations)
+    : BatchConfig() {
+  this->beam_width = beam_width;
+  this->target_iterations = target_iterations;
+  current_iteration = 0;
+}
+
+BeamSearchBatchConfig::BeamSearchBatchConfig(BeamSearchBatchConfig const &other,
+                                             int model_id)
+    : BatchConfig() {
+  this->beam_width = other.beam_width;
+  this->target_iterations = other.target_iterations;
+  this->model_id = model_id;
+  current_iteration = 0;
+}
+
+BeamSearchBatchConfig::~BeamSearchBatchConfig() {}
+
+InferenceMode BeamSearchBatchConfig::get_mode() const {
+  return BEAM_SEARCH_MODE;
+}
+
+bool BeamSearchBatchConfig::done() const {
+  assert(current_iteration <= target_iterations);
+  return current_iteration == target_iterations;
+}
+
+int BeamSearchBatchConfig::max_beam_depth_all_requests() const {
+  int max_depth_all_requests = 0;
+  for (int i = 0; i < BeamSearchBatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (!request_completed[i] &&
+        beamRequestsInfo[i].max_depth > max_depth_all_requests) {
+      /* printf("\treq %i has max_depth=%i. Increasing max_depth_all_requests "
+             "from %i\n",
+             i,
+             beamRequestsInfo[i].max_depth,
+             max_depth_all_requests); */
+      max_depth_all_requests = beamRequestsInfo[i].max_depth;
+    }
+  }
+  assert(max_depth_all_requests <= BeamSearchBatchConfig::MAX_BEAM_DEPTH);
+  return max_depth_all_requests;
+}
+
+int BeamSearchBatchConfig::current_depth_all_requests() const {
+  int current_depth = 0;
+  for (int i = 0; i < BeamSearchBatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (!request_completed[i] &&
+        beamRequestsInfo[i].current_depth > current_depth) {
+      /* printf("\treq %i has current_depth=%i. Increasing "
+             "current_depth_all_requests from %i\n",
+             i,
+             beamRequestsInfo[i].current_depth,
+             current_depth); */
+      current_depth = beamRequestsInfo[i].current_depth;
+    }
+  }
+  assert(current_depth <= BeamSearchBatchConfig::MAX_BEAM_DEPTH + 1);
+  return current_depth;
+}
+
+void BeamSearchBatchConfig::print() const {
+  std::cout << "@@@@@@@@@@@@@@ BeamSearchBatchConfig (mode " << get_mode()
+            << ") @@@@@@@@@@@@@@" << std::endl;
+  std::cout << "Max number of requests: " << MAX_NUM_REQUESTS << std::endl;
+  std::cout << "Max number of tokens: " << MAX_NUM_TOKENS << std::endl;
+  std::cout << "Number of tokens: " << num_tokens << std::endl;
+  std::cout << "Number of requests: " << num_active_requests() << std::endl;
+  std::cout << "Beam width: " << beam_width << std::endl;
+  std::cout << "Target Iterations: " << target_iterations << std::endl;
+  std::cout << "Current Iterations: " << current_iteration << std::endl;
+
+  std::cout << "Per-request info:\n";
+  for (int i = 0; i < MAX_NUM_REQUESTS; i++) {
+    // assert(beamRequestsInfo[i].request_completed == request_completed[i]);
+    if (!request_completed[i]) {
+      std::cout << "  Request " << i << ":\n";
+      std::cout << "    Token start offset: "
+                << requestsInfo[i].token_start_offset << std::endl;
+      std::cout << "    Number of tokens in batch: "
+                << requestsInfo[i].num_tokens_in_batch << std::endl;
+      std::cout << "    GUID: " << requestsInfo[i].request_guid << std::endl;
+      std::cout << "    Max sequence length: "
+                << requestsInfo[i].max_sequence_length << std::endl;
+      std::cout << "    Beam Search Specific: " << std::endl;
+      std::cout << "        beam_size: " << beamRequestsInfo[i].beam_size
+                << std::endl;
+      std::cout << "        current_depth: "
+                << beamRequestsInfo[i].current_depth << std::endl;
+      std::cout << "        max_depth: " << beamRequestsInfo[i].max_depth
+                << std::endl;
+      std::cout << "        tokens: ";
+      for (int j = 0; j < MAX_BEAM_WIDTH; j++) {
+        std::cout << beamRequestsInfo[i].tokens[j] << ", ";
+      }
+      std::cout << std::endl;
+      std::cout << "        probs: ";
+      for (int j = 0; j < MAX_BEAM_WIDTH; j++) {
+        std::cout << beamRequestsInfo[i].probs[j] << ", ";
+      }
+      std::cout << std::endl;
+      std::cout << "        parent_id: ";
+      for (int j = 0; j < MAX_BEAM_WIDTH; j++) {
+        std::cout << beamRequestsInfo[i].parent_id[j] << ", ";
+      }
+      std::cout << std::endl;
+    }
+  }
+
+  std::cout << "Per-token info:\n";
+  for (int i = 0; i < num_tokens; i++) {
+    std::cout << "  Token " << i << ":\n";
+    std::cout << "    Absolute depth in request: "
+              << tokensInfo[i].abs_depth_in_request << std::endl;
+    std::cout << "    Request index: " << tokensInfo[i].request_index
+              << std::endl;
+    std::cout << "    Token id: " << tokensInfo[i].token_id << std::endl;
+    std::cout << "    Beam Search Specific: " << std::endl;
+    std::cout << "        beam_size: " << beamTokenInfo[i].sub_request_index
+              << std::endl;
+    // std::cout << "    Parent token id: " << tokensInfo[i].parent_token_id <<
+    // std::endl; std::cout << "    Accumulated log prob: "
+    //           << tokensInfo[i].cum_log_prob << std::endl;
+  }
+  std::cout << "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
+            << std::endl;
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/cuda_helper.cu b/src/runtime/cuda_helper.cu
index 3e24f6b4e9..e4728bdb88 100644
--- a/src/runtime/cuda_helper.cu
+++ b/src/runtime/cuda_helper.cu
@@ -62,6 +62,14 @@ __global__ void copy_kernel(DT *dst, const DT *src, coord_t size) {
   }
 }
 
+template <typename DT>
+__global__ void
+    copy_kernel_discrete(DT *dst, const DT *src, coord_t size, size_t *index) {
+  CUDA_KERNEL_LOOP(i, size) {
+    dst[i] = src[index[i]];
+  }
+}
+
 template <typename DT>
 __global__ void reluBackward(DT *grad_ptr, const DT *output, size_t n) {
   CUDA_KERNEL_LOOP(i, n) {
@@ -201,22 +209,24 @@ __host__ void updateGAS(float *para_ptr,
 }
 
 template <typename T>
-__host__ void
-    print_tensor(T const *ptr, size_t num_elements, char const *prefix) {
-  // device synchronize to make sure the data are ready
-  // checkCUDA(cudaDeviceSynchronize());
+__host__ void print_tensor(T const *ptr,
+                           size_t num_elements,
+                           char const *prefix,
+                           int shard_id) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
   T *host_ptr;
   checkCUDA(cudaHostAlloc(&host_ptr,
                           sizeof(T) * num_elements,
                           cudaHostAllocPortable | cudaHostAllocMapped));
-  checkCUDA(cudaMemcpy(
-      host_ptr, ptr, sizeof(T) * num_elements, cudaMemcpyDeviceToHost));
-  // checkCUDA(cudaDeviceSynchronize());
+  checkCUDA(cudaMemcpyAsync(
+      host_ptr, ptr, sizeof(T) * num_elements, cudaMemcpyDeviceToHost, stream));
+  cudaDeviceSynchronize();
   int idx = 0;
-  printf("%s", prefix);
+  printf("%s, %d---->", prefix, shard_id);
   for (idx = 0; idx < num_elements; idx++) {
-    printf(" %.4lf", (float)host_ptr[idx]);
-    if (idx >= 16) {
+    printf(" %.20lf", (float)host_ptr[idx]);
+    if (idx >= 100) {
       break;
     }
   }
@@ -224,22 +234,154 @@ __host__ void
   checkCUDA(cudaFreeHost(host_ptr));
 }
 
+template <typename T>
+__host__ void print_beam_tensor(T const *ptr,
+                                size_t num_elements,
+                                int skip,
+                                int channel,
+                                char const *prefix) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  T *host_ptr;
+  checkCUDA(cudaHostAlloc(&host_ptr,
+                          sizeof(T) * channel * skip,
+                          cudaHostAllocPortable | cudaHostAllocMapped));
+  checkCUDA(cudaMemcpyAsync(host_ptr,
+                            ptr,
+                            sizeof(T) * channel * skip,
+                            cudaMemcpyDeviceToHost,
+                            stream));
+  // checkCUDA(cudaDeviceSynchronize());
+  int idx = 0;
+  printf("%s", prefix);
+
+  for (int i = 0; i < channel; i += 1) {
+    for (idx = 0; idx < num_elements; idx++) {
+      printf(" %.20lf", (float)host_ptr[idx + i * skip]);
+      if (idx >= 100) {
+        break;
+      }
+    }
+    printf("\n-----***********------\n");
+  }
+
+  checkCUDA(cudaFreeHost(host_ptr));
+}
+
+template <typename T>
+__host__ void
+    save_tensor(T const *ptr, size_t num_elements, char const *file_name) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  T *host_ptr;
+  checkCUDA(cudaHostAlloc(&host_ptr,
+                          sizeof(T) * num_elements,
+                          cudaHostAllocPortable | cudaHostAllocMapped));
+  checkCUDA(cudaMemcpyAsync(
+      host_ptr, ptr, sizeof(T) * num_elements, cudaMemcpyDeviceToHost, stream));
+  // checkCUDA(cudaDeviceSynchronize());
+  cudaDeviceSynchronize();
+  FILE *tensor_file;
+  tensor_file = fopen(file_name, "w");
+  for (unsigned i = 0; i < num_elements; i++) {
+    fprintf(tensor_file, "%.20f, ", (float)host_ptr[i]);
+  }
+
+  fclose(tensor_file);
+  checkCUDA(cudaFreeHost(host_ptr));
+}
+
+template <typename T>
+__host__ T *download_tensor(T const *ptr, size_t num_elements) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  T *host_ptr;
+  checkCUDA(cudaHostAlloc(&host_ptr,
+                          sizeof(T) * num_elements,
+                          cudaHostAllocPortable | cudaHostAllocMapped));
+  checkCUDA(cudaMemcpyAsync(
+      host_ptr, ptr, sizeof(T) * num_elements, cudaMemcpyDeviceToHost, stream));
+  return host_ptr;
+}
+
+template <typename T>
+__host__ bool download_tensor(T const *ptr, T *dst, size_t num_elements) {
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  assert(dst != nullptr);
+  checkCUDA(cudaMemcpyAsync(
+      dst, ptr, sizeof(T) * num_elements, cudaMemcpyDeviceToHost, stream));
+  return true;
+}
+cudnnStatus_t cudnnSetTensorDescriptorFromDomain4SoftMax(
+    cudnnTensorDescriptor_t tensor, Domain domain, DataType data_type) {
+  int dims[MAX_TENSOR_DIM];
+  cudnnDataType_t cudnn_data_type = ff_to_cudnn_datatype(data_type);
+  switch (domain.get_dim()) {
+    case 1: {
+      Rect<1> rect = domain;
+      dims[0] = rect.hi[0] - rect.lo[0] + 1;
+      return cudnnSetTensor4dDescriptor(
+          tensor, CUDNN_TENSOR_NCHW, cudnn_data_type, dims[0], 1, 1, 1);
+    }
+    case 2: {
+      Rect<2> rect = domain;
+      dims[0] = rect.hi[0] - rect.lo[0] + 1;
+      dims[1] = rect.hi[1] - rect.lo[1] + 1;
+      return cudnnSetTensor4dDescriptor(
+          tensor, CUDNN_TENSOR_NCHW, cudnn_data_type, dims[1], dims[0], 1, 1);
+    }
+    case 3: {
+      Rect<3> rect = domain;
+      dims[0] = rect.hi[0] - rect.lo[0] + 1;
+      dims[1] = rect.hi[1] - rect.lo[1] + 1;
+      dims[2] = rect.hi[2] - rect.lo[2] + 1;
+      return cudnnSetTensor4dDescriptor(tensor,
+                                        CUDNN_TENSOR_NCHW,
+                                        cudnn_data_type,
+                                        dims[2] * dims[1],
+                                        dims[0],
+                                        1,
+                                        1);
+    }
+    case 4: {
+      Rect<4> rect = domain;
+      dims[0] = rect.hi[0] - rect.lo[0] + 1;
+      dims[1] = rect.hi[1] - rect.lo[1] + 1;
+      dims[2] = rect.hi[2] - rect.lo[2] + 1;
+      dims[3] = rect.hi[3] - rect.lo[3] + 1;
+      return cudnnSetTensor4dDescriptor(tensor,
+                                        CUDNN_TENSOR_NCHW,
+                                        cudnn_data_type,
+                                        dims[3] * dims[2] * dims[1],
+                                        dims[0],
+                                        1,
+                                        1);
+    }
+    default:
+      assert(false && "Unsupported dim number");
+  }
+  return CUDNN_STATUS_BAD_PARAM;
+}
+
 cudnnStatus_t cudnnSetTensorDescriptorFromDomain(cudnnTensorDescriptor_t tensor,
-                                                 Domain domain) {
+                                                 Domain domain,
+                                                 DataType data_type) {
   int dims[MAX_TENSOR_DIM];
+  cudnnDataType_t cudnn_data_type = ff_to_cudnn_datatype(data_type);
   switch (domain.get_dim()) {
     case 1: {
       Rect<1> rect = domain;
       dims[0] = rect.hi[0] - rect.lo[0] + 1;
       return cudnnSetTensor4dDescriptor(
-          tensor, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, dims[0], 1, 1, 1);
+          tensor, CUDNN_TENSOR_NCHW, cudnn_data_type, dims[0], 1, 1, 1);
     }
     case 2: {
       Rect<2> rect = domain;
       dims[0] = rect.hi[0] - rect.lo[0] + 1;
       dims[1] = rect.hi[1] - rect.lo[1] + 1;
       return cudnnSetTensor4dDescriptor(
-          tensor, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, dims[1], dims[0], 1, 1);
+          tensor, CUDNN_TENSOR_NCHW, cudnn_data_type, dims[1], dims[0], 1, 1);
     }
     case 3: {
       Rect<3> rect = domain;
@@ -248,7 +390,7 @@ cudnnStatus_t cudnnSetTensorDescriptorFromDomain(cudnnTensorDescriptor_t tensor,
       dims[2] = rect.hi[2] - rect.lo[2] + 1;
       return cudnnSetTensor4dDescriptor(tensor,
                                         CUDNN_TENSOR_NCHW,
-                                        CUDNN_DATA_FLOAT,
+                                        cudnn_data_type,
                                         dims[2],
                                         dims[1],
                                         dims[0],
@@ -262,7 +404,7 @@ cudnnStatus_t cudnnSetTensorDescriptorFromDomain(cudnnTensorDescriptor_t tensor,
       dims[3] = rect.hi[3] - rect.lo[3] + 1;
       return cudnnSetTensor4dDescriptor(tensor,
                                         CUDNN_TENSOR_NCHW,
-                                        CUDNN_DATA_FLOAT,
+                                        cudnn_data_type,
                                         dims[3],
                                         dims[2],
                                         dims[1],
@@ -278,7 +420,7 @@ cudnnStatus_t cudnnSetTensorDescriptorFromDomain(cudnnTensorDescriptor_t tensor,
       dims[3] = rect.hi[3] - rect.lo[3] + 1;
       return cudnnSetTensor4dDescriptor(tensor,
                                         CUDNN_TENSOR_NCHW,
-                                        CUDNN_DATA_FLOAT,
+                                        cudnn_data_type,
                                         dims[3],
                                         dims[2],
                                         dims[1],
@@ -292,6 +434,8 @@ cudnnStatus_t cudnnSetTensorDescriptorFromDomain(cudnnTensorDescriptor_t tensor,
 
 cudnnDataType_t ff_to_cudnn_datatype(DataType type) {
   switch (type) {
+    case DT_HALF:
+      return CUDNN_DATA_HALF;
     case DT_FLOAT:
       return CUDNN_DATA_FLOAT;
     case DT_DOUBLE:
@@ -306,6 +450,8 @@ cudnnDataType_t ff_to_cudnn_datatype(DataType type) {
 
 cudaDataType_t ff_to_cuda_datatype(DataType type) {
   switch (type) {
+    case DT_HALF:
+      return CUDA_R_16F;
     case DT_FLOAT:
       return CUDA_R_32F;
     case DT_DOUBLE:
@@ -318,6 +464,52 @@ cudaDataType_t ff_to_cuda_datatype(DataType type) {
   return CUDA_R_32F;
 }
 
+#ifdef FF_USE_NCCL
+ncclDataType_t ff_to_nccl_datatype(DataType type) {
+  switch (type) {
+    case DT_HALF:
+      return ncclHalf;
+    case DT_FLOAT:
+      return ncclFloat;
+    case DT_DOUBLE:
+      return ncclDouble;
+    case DT_INT32:
+      return ncclInt;
+    default:
+      assert(false && "Unspoorted nccl data type");
+  }
+  return ncclFloat;
+}
+#endif
+
+cudaDataType_t cudnn_to_cuda_datatype(cudnnDataType_t type) {
+  switch (type) {
+    case CUDNN_DATA_FLOAT:
+      return CUDA_R_32F;
+    case CUDNN_DATA_DOUBLE:
+      return CUDA_R_64F;
+    case CUDNN_DATA_INT32:
+      return CUDA_R_32I;
+    default:
+      assert(false && "Unsupported cuda data type");
+  }
+  return CUDA_R_32F;
+}
+
+cudnnDataType_t cuda_to_cudnn_datatype(cudaDataType_t type) {
+  switch (type) {
+    case CUDA_R_32F:
+      return CUDNN_DATA_FLOAT;
+    case CUDA_R_64F:
+      return CUDNN_DATA_DOUBLE;
+    case CUDA_R_32I:
+      return CUDNN_DATA_INT32;
+    default:
+      assert(false && "Unsupported cudnn data type");
+  }
+  return CUDNN_DATA_FLOAT;
+}
+
 template __global__ void
     assign_kernel<half>(half *ptr, coord_t size, half value);
 template __global__ void
@@ -329,6 +521,8 @@ template __global__ void
 template __global__ void
     assign_kernel<int64_t>(int64_t *ptr, coord_t size, int64_t value);
 
+template __global__ void
+    add_kernel<half>(half *dst, half const *src, size_t size);
 template __global__ void
     add_kernel<float>(float *dst, float const *src, size_t size);
 template __global__ void
@@ -338,13 +532,26 @@ template __global__ void
 template __global__ void
     add_kernel<int64_t>(int64_t *dst, int64_t const *src, size_t size);
 
+template __global__ void
+    copy_kernel<half>(half *dst, half const *src, coord_t size);
 template __global__ void
     copy_kernel<float>(float *dst, float const *src, coord_t size);
+template __global__ void
+    copy_kernel<double>(double *dst, double const *src, coord_t size);
 template __global__ void
     copy_kernel<int32_t>(int32_t *dst, int32_t const *src, coord_t size);
 template __global__ void
     copy_kernel<int64_t>(int64_t *dst, int64_t const *src, coord_t size);
 
+template __global__ void copy_kernel_discrete<float>(float *dst,
+                                                     float const *src,
+                                                     coord_t size,
+                                                     size_t *index);
+template __global__ void copy_kernel_discrete<int64_t>(int64_t *dst,
+                                                       int64_t const *src,
+                                                       coord_t size,
+                                                       size_t *index);
+
 template __global__ void apply_add_with_scale<float>(float *data_ptr,
                                                      float const *grad_ptr,
                                                      size_t size,
@@ -362,11 +569,71 @@ template __global__ void apply_add_with_scale<int64_t>(int64_t *data_ptr,
                                                        size_t size,
                                                        int64_t scale);
 
+template __host__ void print_tensor<float>(float const *ptr,
+                                           size_t rect,
+                                           char const *prefix,
+                                           int shard_id);
+template __host__ void print_tensor<double>(double const *ptr,
+                                            size_t rect,
+                                            char const *prefix,
+                                            int shard_id);
+template __host__ void print_tensor<int32_t>(int32_t const *ptr,
+                                             size_t rect,
+                                             char const *prefix,
+                                             int shard_id);
+template __host__ void print_tensor<int64_t>(int64_t const *ptr,
+                                             size_t rect,
+                                             char const *prefix,
+                                             int shard_id);
+template __host__ void print_tensor<half>(half const *ptr,
+                                          size_t rect,
+                                          char const *prefix,
+                                          int shard_id);
+
+template __host__ void print_beam_tensor<float>(float const *ptr,
+                                                size_t num_elements,
+                                                int skip,
+                                                int channel,
+                                                char const *prefix);
+template __host__ void print_beam_tensor<int32_t>(int32_t const *ptr,
+                                                  size_t num_elements,
+                                                  int skip,
+                                                  int channel,
+                                                  char const *prefix);
+template __host__ void print_beam_tensor<int64_t>(int64_t const *ptr,
+                                                  size_t num_elements,
+                                                  int skip,
+                                                  int channel,
+                                                  char const *prefix);
+
 template __host__ void
-    print_tensor<float>(float const *ptr, size_t rect, char const *prefix);
-template __host__ void
-    print_tensor<double>(double const *ptr, size_t rect, char const *prefix);
-template __host__ void
-    print_tensor<int32_t>(int32_t const *ptr, size_t rect, char const *prefix);
+    save_tensor<float>(float const *ptr, size_t rect, char const *file_name);
+template __host__ void save_tensor<int64_t>(int64_t const *ptr,
+                                            size_t rect,
+                                            char const *file_name);
 template __host__ void
-    print_tensor<int64_t>(int64_t const *ptr, size_t rect, char const *prefix);
+    save_tensor<half>(half const *ptr, size_t rect, char const *file_name);
+
+template __host__ float *download_tensor<float>(float const *ptr,
+                                                size_t num_elements);
+template __host__ half *download_tensor<half>(half const *ptr,
+                                              size_t num_elements);
+template __host__ double *download_tensor<double>(double const *ptr,
+                                                  size_t num_elements);
+template __host__ int32_t *download_tensor<int32_t>(int32_t const *ptr,
+                                                    size_t num_elements);
+template __host__ int64_t *download_tensor<int64_t>(int64_t const *ptr,
+                                                    size_t num_elements);
+template __host__ bool
+    download_tensor<float>(float const *ptr, float *dst, size_t num_elements);
+template __host__ bool
+    download_tensor<half>(half const *ptr, half *dst, size_t num_elements);
+template __host__ bool download_tensor<double>(double const *ptr,
+                                               double *dst,
+                                               size_t num_elements);
+template __host__ bool download_tensor<int32_t>(int32_t const *ptr,
+                                                int32_t *dst,
+                                                size_t num_elements);
+template __host__ bool download_tensor<int64_t>(int64_t const *ptr,
+                                                int64_t *dst,
+                                                size_t num_elements);
diff --git a/src/runtime/ffconst_utils.cc b/src/runtime/ffconst_utils.cc
index d8f4e6e179..0723ee136d 100644
--- a/src/runtime/ffconst_utils.cc
+++ b/src/runtime/ffconst_utils.cc
@@ -1,4 +1,5 @@
 #include "flexflow/ffconst_utils.h"
+#include "flexflow/accessor.h"
 #include <stdexcept>
 
 namespace FlexFlow {
@@ -45,6 +46,8 @@ std::string get_operator_type_name(OperatorType type) {
       return "Split";
     case OP_EMBEDDING:
       return "Embedding";
+    case OP_EXPERTS:
+      return "Experts";
     case OP_GATHER:
       return "Gather";
     case OP_GROUP_BY:
@@ -111,6 +114,10 @@ std::string get_operator_type_name(OperatorType type) {
       return "Size";
     case OP_TOPK:
       return "TopK";
+    case OP_ARG_TOPK:
+      return "ArgTopK";
+    case OP_BEAM_TOPK:
+      return "BeamTopK";
     case OP_WHERE:
       return "Where";
     case OP_CEIL:
@@ -141,6 +148,12 @@ std::string get_operator_type_name(OperatorType type) {
       return "PReLU";
     case OP_MULTIHEAD_ATTENTION:
       return "MultiHeadAttention";
+    case OP_INC_MULTIHEAD_SELF_ATTENTION:
+      return "IncMultiHeadSelfAttention";
+    case OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION:
+      return "SpecIncMultiHeadSelfAttention";
+    case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION:
+      return "TreeIncMultiHeadSelfAttention";
     case OP_INPUT:
       return "Input";
     case OP_WEIGHT:
@@ -157,8 +170,16 @@ std::string get_operator_type_name(OperatorType type) {
       return "Mean";
     case OP_LAYERNORM:
       return "LayerNorm";
+    case OP_RMS_NORM:
+      return "RMSNorm";
+    case OP_GELU:
+      return "GELU";
     case OP_IDENTITY:
       return "Identity";
+    case OP_SAMPLING:
+      return "Sampling";
+    case OP_ARGMAX:
+      return "ArgMax";
     // Parallel Ops
     case OP_REPARTITION:
       return "Repartition";
@@ -168,6 +189,8 @@ std::string get_operator_type_name(OperatorType type) {
       return "Replicate";
     case OP_REDUCTION:
       return "Reduction";
+    case OP_ALLREDUCE:
+      return "AllReduce";
     case OP_PIPELINE:
       return "Pipeline";
     case OP_FUSED_PARALLEL:
@@ -178,6 +201,34 @@ std::string get_operator_type_name(OperatorType type) {
   }
 }
 
+size_t data_type_size(DataType type) {
+  switch (type) {
+    case DT_HALF:
+      return sizeof(half);
+    case DT_FLOAT:
+      return sizeof(float);
+    case DT_DOUBLE:
+      return sizeof(double);
+    case DT_INT32:
+      return sizeof(int32_t);
+    case DT_INT64:
+      return sizeof(int64_t);
+    case DT_BOOLEAN:
+      return sizeof(bool);
+    default:
+      assert(false);
+  }
+}
+
+size_t get_quantization_to_byte_size(DataType type,
+                                     DataType quantization_type,
+                                     size_t num_elements) {
+  assert(quantization_type == DT_INT4 || quantization_type == DT_INT8);
+  return (num_elements / (quantization_type == DT_INT4 ? 2 : 1)) +
+         (num_elements / INT4_NUM_OF_ELEMENTS_PER_GROUP) * 2 *
+             data_type_size(type);
+}
+
 std::ostream &operator<<(std::ostream &s, OperatorType op_type) {
   s << get_operator_type_name(op_type);
 
diff --git a/src/runtime/fftype.cc b/src/runtime/fftype.cc
index 91e0d077c4..2b94f07999 100644
--- a/src/runtime/fftype.cc
+++ b/src/runtime/fftype.cc
@@ -1,11 +1,15 @@
 #include "flexflow/fftype.h"
+#include "flexflow/config.h"
 #include <cassert>
 
 namespace FlexFlow {
 
-LayerID::LayerID() : id(0) {}
+const LayerID LayerID::NO_ID = LayerID();
 
-LayerID::LayerID(size_t _id) : id(_id) {
+LayerID::LayerID() : id(0), transformer_layer_id(MAX_NUM_TRANSFORMER_LAYERS) {}
+
+LayerID::LayerID(size_t _id, size_t _transformer_layer_id)
+    : id(_id), transformer_layer_id(_transformer_layer_id) {
   assert(is_valid_id());
 }
 
@@ -14,7 +18,11 @@ bool LayerID::is_valid_id() const {
 }
 
 bool operator==(LayerID const &lhs, LayerID const &rhs) {
+  // id should be sufficient to distinguish different layers
+  if (lhs.id == rhs.id) {
+    assert(lhs.transformer_layer_id == rhs.transformer_layer_id);
+  }
   return lhs.id == rhs.id;
 }
 
-}; // namespace FlexFlow
\ No newline at end of file
+}; // namespace FlexFlow
diff --git a/src/runtime/gpt_tokenizer.cc b/src/runtime/gpt_tokenizer.cc
new file mode 100644
index 0000000000..56fdd05b3b
--- /dev/null
+++ b/src/runtime/gpt_tokenizer.cc
@@ -0,0 +1,324 @@
+// version 0.1
+// Licensed under the MIT License <http://opensource.org/licenses/MIT>.
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2019-2020 zili wang <wzlnot@gmail.com>.
+
+#include <flexflow/gpt_tokenizer.h>
+
+using json = nlohmann::json;
+
+// codecvt abandoned in c++17
+std::wstring GPT_Tokenizer::utf8_to_wstring(std::string const &src) {
+  std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
+  return converter.from_bytes(src);
+};
+
+std::u32string GPT_Tokenizer::utf8_to_utf32(std::string const &src) {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
+  return converter.from_bytes(src);
+};
+
+std::string GPT_Tokenizer::wstring_to_utf8(std::wstring const &src) {
+  std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
+  return converter.to_bytes(src);
+};
+
+std::string GPT_Tokenizer::utf32_to_utf8(std::u32string const &src) {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
+  return converter.to_bytes(src);
+};
+
+wchar_t *GPT_Tokenizer::bytes_to_unicode() {
+  std::vector<uint32_t> bs;
+  for (auto i = uint32_t(L'!'); i < uint32_t(L'~') + 1; ++i) {
+    bs.push_back(i);
+  }
+  for (auto i = uint32_t(L'¡'); i < uint32_t(L'¬') + 1; ++i) {
+    bs.push_back(i);
+  }
+  for (auto i = uint32_t(L'®'); i < uint32_t(L'ÿ') + 1; ++i) {
+    bs.push_back(i);
+  }
+  std::vector<uint32_t> cs = bs;
+  uint32_t n = 0;
+  for (uint32_t b = 0; b < 256; ++b) {
+    auto p = find(bs.begin(), bs.end(), b);
+    if (p == bs.end()) {
+      bs.push_back(b);
+      cs.push_back(256 + n);
+      n++;
+    }
+  }
+  static wchar_t bytes_mapping[256] = {};
+  for (size_t i = 0; i < 256; i++) {
+    bytes_mapping[i] = i;
+  }
+  for (size_t i = 0; i < bs.size(); i++) {
+    bytes_mapping[bs[i]] = cs[i];
+  }
+  return bytes_mapping;
+}
+
+void GPT_Tokenizer::unicode_to_bytes() {
+  for (int i = 0; i < 256; i++) {
+    bytes_decoder[bytes_encoder[i]] = (char)i;
+  }
+}
+
+std::vector<std::string> GPT_Tokenizer::split(std::string const &s,
+                                              std::regex rgx) {
+  std::vector<std::string> elems;
+  std::sregex_token_iterator iter(s.begin(), s.end(), rgx, -1);
+  std::sregex_token_iterator end;
+  while (iter != end) {
+    elems.push_back(*iter);
+    ++iter;
+  }
+  return elems;
+};
+
+std::string GPT_Tokenizer::strip(std::string const &inpt) {
+  if (inpt.length() == 0) {
+    return inpt;
+  }
+  auto start_it = inpt.begin();
+  auto end_it = inpt.rbegin();
+  while (std::isspace(*start_it)) {
+    ++start_it;
+  }
+  if (start_it == inpt.end()) {
+    return "";
+  }
+  while (std::isspace(*end_it)) {
+    ++end_it;
+  }
+  return std::string(start_it, end_it.base());
+}
+
+std::unordered_set<wbigram_pair, hash_pair>
+    GPT_Tokenizer::get_pairs(std::vector<std::wstring> word) {
+  std::unordered_set<wbigram_pair, hash_pair> pairs;
+  std::wstring prev_char = word[0];
+  for (size_t i = 1; i < word.size(); ++i) {
+    pairs.insert(wbigram_pair({prev_char, word[i]}));
+    prev_char = word[i];
+  }
+  return pairs;
+};
+
+void GPT_Tokenizer::load_vocab(std::string const &vocab_file) {
+  std::ifstream file_handle(vocab_file);
+  assert(file_handle.good() && "file not exists");
+  bool discard_first_line = false;
+  if (discard_first_line) {
+    std::string first_line_discard;
+    std::getline(file_handle, first_line_discard); // skip the first line
+  }
+  json vocab_data_ = json::parse(file_handle,
+                                 /*parser_callback_t */ nullptr,
+                                 /*allow_exceptions */ true,
+                                 /*ignore_comments */ true);
+  auto vocab_ = vocab_data_.get<std::unordered_map<std::string, int32_t>>();
+  for (auto item : vocab_) {
+    vocab.insert({item.first, item.second});
+    inverse_vocab.insert({item.second, item.first});
+  }
+};
+
+void GPT_Tokenizer::load_merge(std::string const &merge_file) {
+  bpe_ranks.reserve(60000);
+  std::ifstream file_handle(merge_file);
+  assert(file_handle.good() && "file not exists");
+  std::string line;
+  uint32_t curr_idx = 0;
+  std::string version_substring = "#version:";
+  while (getline(file_handle, line)) {
+    if (line.size() == 0 || line.rfind(version_substring, 0) == 0) {
+      continue;
+    }
+    std::vector<std::string> bigrams = split(line);
+    assert(bigrams.size() == 2 && "unk format");
+    wbigram_pair curr(utf8_to_wstring(bigrams[0]), utf8_to_wstring(bigrams[1]));
+    bpe_ranks.insert({curr, curr_idx});
+    curr_idx++;
+  }
+};
+
+std::vector<std::string> GPT_Tokenizer::bpe(std::wstring token) {
+  // bpe use wstring
+  if (cache.find(token) != cache.end()) {
+    return cache[token];
+  }
+  std::vector<std::wstring> wword;
+  for (auto c : token) {
+    wword.push_back(std::wstring(1, c));
+  }
+  std::unordered_set<wbigram_pair, hash_pair> pairs = get_pairs(wword);
+  if (pairs.empty()) {
+    return {wstring_to_utf8(token)};
+  }
+
+  while (true) {
+    auto bigram = pairs.begin();
+    if (pairs.size() > 1) {
+      bigram = std::min_element(
+          pairs.begin(),
+          pairs.end(),
+          [this](wbigram_pair const &a, wbigram_pair const &b) -> bool {
+            if (bpe_ranks.find(a) == bpe_ranks.end()) {
+              return false;
+            }
+            if (bpe_ranks.find(b) == bpe_ranks.end()) {
+              return true;
+            }
+            return bpe_ranks[a] < bpe_ranks[b];
+          });
+    }
+    if (bpe_ranks.find(*bigram) == bpe_ranks.end()) {
+      break;
+    }
+    std::wstring first = bigram->first;
+    std::wstring second = bigram->second;
+    decltype(wword) new_wword;
+
+    auto i = wword.begin();
+    while (i < wword.end()) {
+      auto j = std::find(i, wword.end(), first);
+      if (j == wword.end()) {
+        new_wword.insert(new_wword.end(), i, wword.end());
+        break;
+      }
+      new_wword.insert(new_wword.end(), i, j);
+      i = j;
+      // i <= wword.end
+      if (*i == first && i < wword.end() - 1 && *(i + 1) == second) {
+        new_wword.push_back(first + second);
+        i += 2;
+      } else {
+        new_wword.push_back(*i);
+        i += 1;
+      }
+    }
+    wword = new_wword;
+    if (wword.size() == 1) {
+      break;
+    } else {
+      pairs = get_pairs(wword);
+    }
+  }
+  std::vector<std::string> word;
+  for (auto w : wword) {
+    word.push_back(wstring_to_utf8(w));
+  }
+  if (token.size() < cache_word_max_length && cache.size() < cache_max_size) {
+    cache.insert({token, word});
+  }
+  return word;
+};
+
+std::vector<std::string> GPT_Tokenizer::tokenize(std::string str) {
+  std::vector<std::string> bpe_tokens;
+  std::wstring wstr = utf8_to_wstring(str);
+  std::wsregex_iterator iter(wstr.begin(), wstr.end(), pat);
+  std::wsregex_iterator end;
+  while (iter != end) {
+    std::wstring token;
+    for (char c : wstring_to_utf8(iter->str())) {
+      if (0 > c) {
+        token.push_back(*(bytes_encoder + c + 256));
+      } else {
+        token.push_back(*(bytes_encoder + c));
+      }
+    }
+    if (token.length() > 0) {
+      decltype(bpe_tokens) curr_bpe_tokens = bpe(token);
+      bpe_tokens.insert(
+          bpe_tokens.end(), curr_bpe_tokens.begin(), curr_bpe_tokens.end());
+    }
+    ++iter;
+  }
+  return bpe_tokens;
+}
+
+int32_t GPT_Tokenizer::convert_token_to_id(std::string token) {
+  auto p = vocab.find(token);
+  if (p != vocab.end()) {
+    return vocab[token];
+  } else {
+    return vocab[unk_token];
+  }
+}
+
+void GPT_Tokenizer::encode(std::string str,
+                           size_t max_length,
+                           std::vector<int32_t> *input_ids,
+                           std::vector<int32_t> *mask_ids) {
+  if (not input_ids->empty()) {
+    input_ids->clear();
+  }
+  if (not mask_ids->empty()) {
+    mask_ids->clear();
+  }
+  input_ids->reserve(max_length);
+  mask_ids->reserve(max_length);
+  // input_ids->push_back(vocab[bos_token]);
+  // mask_ids->push_back(1);
+  auto tokens = tokenize(str);
+  for (auto t : tokens) {
+    if (input_ids->size() == max_length - 1) {
+      break;
+    }
+    input_ids->push_back(convert_token_to_id(t));
+    mask_ids->push_back(1);
+  }
+  // input_ids->push_back(vocab[eos_token]);
+  // mask_ids->push_back(1);
+  while (input_ids->size() < max_length) {
+    input_ids->push_back(vocab[pad_token]);
+    mask_ids->push_back(0);
+  }
+  if (mode == OPT_TOKENIZER) {
+    mask_ids->insert(mask_ids->begin(), 1);
+    input_ids->insert(input_ids->begin(), 2);
+  }
+}
+
+std::string GPT_Tokenizer::decode(std::vector<int32_t> input_ids,
+                                  std::vector<int32_t> mask_ids) {
+  // look up each number in encoder.json dictionary
+  std::ostringstream oss;
+  int index = 0;
+  for (auto const &id : input_ids) {
+    if (index == 0) {
+      if (mode == OPT_TOKENIZER) {
+        if (id == 2) {
+          index++;
+        }
+        continue;
+      }
+    }
+    if (!mask_ids[index]) {
+      index++;
+      continue;
+    }
+    auto it = inverse_vocab.find(id);
+    if (it != inverse_vocab.end()) {
+      oss << it->second;
+    } else {
+      // Handle the case when the integer is not found in the inverse_vocab map.
+      // You can choose to ignore it, skip it, or handle it differently based on
+      // your requirements.
+      assert(false);
+    }
+    index++;
+  }
+  std::string concatenated_tokens = oss.str();
+  // apply byte_decoder to each character in the input_ids string, then decode
+  // as utf-8
+  std::wstring wstr = utf8_to_wstring(concatenated_tokens);
+  std::string result;
+  for (wchar_t ch : wstr) {
+    result += bytes_decoder[ch];
+  }
+  return result;
+}
diff --git a/src/runtime/graph.cc b/src/runtime/graph.cc
index 5dbdae1ac0..f348ca9016 100644
--- a/src/runtime/graph.cc
+++ b/src/runtime/graph.cc
@@ -16,8 +16,11 @@
 #include "flexflow/dominators.h"
 #include "flexflow/ffconst_utils.h"
 #include "flexflow/ops/aggregate.h"
+#include "flexflow/ops/arg_topk.h"
+#include "flexflow/ops/argmax.h"
 #include "flexflow/ops/attention.h"
 #include "flexflow/ops/batch_matmul.h"
+#include "flexflow/ops/beam_topk.h"
 #include "flexflow/ops/cast.h"
 #include "flexflow/ops/concat.h"
 #include "flexflow/ops/conv_2d.h"
@@ -25,19 +28,26 @@
 #include "flexflow/ops/element_binary.h"
 #include "flexflow/ops/element_unary.h"
 #include "flexflow/ops/embedding.h"
+#include "flexflow/ops/experts.h"
 #include "flexflow/ops/flat.h"
 #include "flexflow/ops/gather.h"
 #include "flexflow/ops/groupby.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
 #include "flexflow/ops/layer_norm.h"
 #include "flexflow/ops/linear.h"
 #include "flexflow/ops/noop.h"
 #include "flexflow/ops/pool_2d.h"
 #include "flexflow/ops/reduce.h"
 #include "flexflow/ops/reshape.h"
+#include "flexflow/ops/rms_norm.h"
+#include "flexflow/ops/sampling.h"
 #include "flexflow/ops/softmax.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
 #include "flexflow/ops/split.h"
 #include "flexflow/ops/topk.h"
 #include "flexflow/ops/transpose.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/parallel_ops/allreduce.h"
 #include "flexflow/parallel_ops/combine.h"
 #include "flexflow/parallel_ops/fused_parallel_op.h"
 #include "flexflow/parallel_ops/partition.h"
@@ -1953,14 +1963,61 @@ std::pair<std::unique_ptr<Graph>, std::unordered_map<Node, MachineView>>
     }
     curr_best_graph = std::unique_ptr<Graph>(graph);
     MachineView data_parallel_view;
-    data_parallel_view.device_type = MachineView::GPU;
-    data_parallel_view.ndims = 1;
-    data_parallel_view.dim[0] =
-        model->config.numNodes * model->config.workersPerNode;
-    data_parallel_view.stride[0] = 1;
-    data_parallel_view.start_device_id = 0;
+    int degree, num_transformer_layers_per_stage;
+    if (model->config.computationMode == COMP_MODE_TRAINING) {
+      data_parallel_view.device_type = MachineView::GPU;
+      data_parallel_view.ndims = 1;
+      data_parallel_view.dim[0] =
+          model->config.numNodes * model->config.workersPerNode;
+      data_parallel_view.stride[0] = 1;
+      data_parallel_view.start_device_id = 0;
+    } else {
+      // Currently assume a 1D machine view is needed
+      assert(model->config.data_parallelism_degree == 1 ||
+             model->config.tensor_parallelism_degree == 1);
+      degree = model->config.data_parallelism_degree *
+               model->config.tensor_parallelism_degree;
+      num_transformer_layers_per_stage =
+          model->current_transformer_layer_id /
+              model->config.pipeline_parallelism_degree +
+          1;
+    }
     for (auto const &node : curr_best_graph->inEdges) {
-      curr_optimal_views[node.first] = data_parallel_view;
+      Op const *op = node.first.ptr;
+      if (model->config.computationMode == COMP_MODE_TRAINING) {
+        curr_optimal_views[node.first] = data_parallel_view;
+      } else {
+        MachineView mv;
+        mv.device_type = MachineView::GPU;
+        mv.ndims = 1;
+        int total_parallel_degree = 1;
+        for (int i = 0; i < op->outputs[0]->num_dims; i++) {
+          total_parallel_degree *= op->outputs[0]->dims[i].degree;
+        }
+        mv.dim[0] = total_parallel_degree;
+        mv.stride[0] = 1;
+        LayerID layer_guid = op->layer_guid;
+        if (op->op_type == OP_INPUT) {
+          // All inputs are assigned to the first stage
+          layer_guid.transformer_layer_id = 0;
+        } else if (layer_guid == LayerID::NO_ID) {
+          // Assert that we only have a single input
+          while (op->layer_guid == LayerID::NO_ID) {
+            assert(op->numInputs == 1);
+            op = op->inputs[0]->owner_op;
+            assert(op != nullptr);
+          }
+          layer_guid = op->layer_guid;
+        }
+        mv.start_device_id = degree * (layer_guid.transformer_layer_id /
+                                       num_transformer_layers_per_stage);
+        assert(mv.start_device_id + degree - 1 <
+               model->config.numNodes * model->config.workersPerNode);
+        curr_optimal_views[node.first] = mv;
+        for (int i = 0; i < node.first.ptr->numOutputs; i++) {
+          assert(node.first.ptr->outputs[i]->is_valid_machine_view(mv));
+        }
+      }
     }
   } else {
     // Main step to optimize the PCG of an FFModel
@@ -2229,23 +2286,17 @@ GraphOptimalViewSerialized
       case OP_EMBEDDING: {
         Embedding *embed = (Embedding *)op;
         sez.serialize(embed->layer_guid.id);
+        sez.serialize(embed->layer_guid.transformer_layer_id);
         sez.serialize(embed->num_entries);
         sez.serialize(embed->out_channels);
         sez.serialize(embed->aggr);
         sez.serialize(embed->data_type);
         break;
       }
-      case OP_EW_ADD:
-      case OP_EW_SUB:
-      case OP_EW_MUL:
-      case OP_EW_MAX:
-      case OP_EW_MIN: {
-        sez.serialize(op->op_type);
-        break;
-      }
       case OP_MULTIHEAD_ATTENTION: {
         MultiHeadAttention *attn = (MultiHeadAttention *)op;
         sez.serialize(attn->layer_guid.id);
+        sez.serialize(attn->layer_guid.transformer_layer_id);
         sez.serialize(attn->oProjSize);
         sez.serialize(attn->num_heads);
         sez.serialize(attn->qProjSize);
@@ -2256,6 +2307,71 @@ GraphOptimalViewSerialized
         sez.serialize(attn->add_zero_attn);
         break;
       }
+      case OP_INC_MULTIHEAD_SELF_ATTENTION: {
+        IncMultiHeadSelfAttention *attn = (IncMultiHeadSelfAttention *)op;
+        sez.serialize(attn->layer_guid.id);
+        sez.serialize(attn->layer_guid.transformer_layer_id);
+        sez.serialize(attn->oProjSize);
+        sez.serialize(attn->num_q_heads);
+        sez.serialize(attn->qProjSize);
+        sez.serialize(attn->vProjSize);
+        sez.serialize(attn->dropout);
+        sez.serialize(attn->bias);
+        sez.serialize(attn->add_bias_kv);
+        sez.serialize(attn->add_zero_attn);
+        sez.serialize(attn->apply_rotary_embedding);
+        sez.serialize(attn->scaling_query);
+        sez.serialize(attn->scaling_factor);
+        sez.serialize(attn->qk_prod_scaling);
+        sez.serialize(attn->quantization_type);
+        sez.serialize(attn->offload);
+        sez.serialize(attn->num_kv_heads);
+        sez.serialize(attn->tensor_parallelism_degree);
+        break;
+      }
+      case OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION: {
+        SpecIncMultiHeadSelfAttention *attn =
+            (SpecIncMultiHeadSelfAttention *)op;
+        sez.serialize(attn->layer_guid.id);
+        sez.serialize(attn->layer_guid.transformer_layer_id);
+        sez.serialize(attn->oProjSize);
+        sez.serialize(attn->num_q_heads);
+        sez.serialize(attn->qProjSize);
+        sez.serialize(attn->vProjSize);
+        sez.serialize(attn->dropout);
+        sez.serialize(attn->bias);
+        sez.serialize(attn->add_bias_kv);
+        sez.serialize(attn->add_zero_attn);
+        sez.serialize(attn->apply_rotary_embedding);
+        sez.serialize(attn->scaling_query);
+        sez.serialize(attn->scaling_factor);
+        sez.serialize(attn->qk_prod_scaling);
+        sez.serialize(attn->num_kv_heads);
+        break;
+      }
+      case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION: {
+        TreeIncMultiHeadSelfAttention *attn =
+            (TreeIncMultiHeadSelfAttention *)op;
+        sez.serialize(attn->layer_guid.id);
+        sez.serialize(attn->layer_guid.transformer_layer_id);
+        sez.serialize(attn->oProjSize);
+        sez.serialize(attn->num_q_heads);
+        sez.serialize(attn->qProjSize);
+        sez.serialize(attn->vProjSize);
+        sez.serialize(attn->dropout);
+        sez.serialize(attn->bias);
+        sez.serialize(attn->add_bias_kv);
+        sez.serialize(attn->add_zero_attn);
+        sez.serialize(attn->apply_rotary_embedding);
+        sez.serialize(attn->scaling_query);
+        sez.serialize(attn->scaling_factor);
+        sez.serialize(attn->qk_prod_scaling);
+        sez.serialize(attn->quantization_type);
+        sez.serialize(attn->offload);
+        sez.serialize(attn->num_kv_heads);
+        sez.serialize(attn->tensor_parallelism_degree);
+        break;
+      }
       case OP_SOFTMAX: {
         Softmax *softmax = (Softmax *)op;
         sez.serialize(softmax->dim);
@@ -2285,6 +2401,11 @@ GraphOptimalViewSerialized
         sez.serialize(combine->combine_degree);
         break;
       }
+      case OP_ALLREDUCE: {
+        AllReduce *allreduce = (AllReduce *)op;
+        sez.serialize(allreduce->allreduce_dim);
+        break;
+      }
       case OP_FUSED_PARALLEL: {
         FusedParallelOp *fused = (FusedParallelOp *)op;
         sez.serialize(fused->num_parallel_ops);
@@ -2343,6 +2464,18 @@ void FFModel::register_all_machine_views(
       valid_views.push_back(view);
     }
   }
+  // No-parallelism views
+  for (int i = 1; i <= num_nodes * gpus_per_node; i++) {
+    if (num_nodes * gpus_per_node % i == 0) {
+      MachineView view;
+      view.device_type = MachineView::GPU;
+      view.ndims = 1;
+      view.dim[0] = i;
+      view.stride[0] = 0;
+      view.start_device_id = 0;
+      valid_views.push_back(view);
+    }
+  }
   // Two-dimensional views
   /* for (int i = 1; i <= num_nodes; i++) { */
   /*   for (int j = 1; j <= gpus_per_node; j++) { */
@@ -2499,10 +2632,11 @@ void FFModel::deserialize_graph_optimal_view(
         assert(num_inputs == 1);
         AggrMode aggr;
         int num_entries, out_channels;
-        size_t id;
+        size_t id, transformer_layer_id;
         DataType data_type;
         dez.deserialize(id);
-        LayerID layer_guid(id);
+        dez.deserialize(transformer_layer_id);
+        LayerID layer_guid(id, transformer_layer_id);
         dez.deserialize(num_entries);
         dez.deserialize(out_channels);
         dez.deserialize(aggr);
@@ -2522,11 +2656,7 @@ void FFModel::deserialize_graph_optimal_view(
       case OP_EW_MUL:
       case OP_EW_MAX:
       case OP_EW_MIN: {
-        assert(num_inputs == 2);
-        OperatorType op_type;
-        dez.deserialize(op_type);
-        node = get_or_create_node<ElementBinary>({inputs[0], inputs[1]},
-                                                 {op_type});
+        node = ElementBinary::deserialize(*this, dez, inputs, num_inputs);
         break;
       }
       case OP_CONV2D: {
@@ -2577,9 +2707,10 @@ void FFModel::deserialize_graph_optimal_view(
         int embed_dim, num_heads, k_dim, v_dim;
         float dropout;
         bool bias, add_bias_kv, add_zero_attn;
-        size_t id;
+        size_t id, transformer_layer_id;
         dez.deserialize(id);
-        LayerID layer_guid(id);
+        dez.deserialize(transformer_layer_id);
+        LayerID layer_guid(id, transformer_layer_id);
         dez.deserialize(embed_dim);
         dez.deserialize(num_heads);
         dez.deserialize(k_dim);
@@ -2603,26 +2734,188 @@ void FFModel::deserialize_graph_optimal_view(
             {inputs[0], inputs[1], inputs[2]}, params);
         break;
       }
+      case OP_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(num_inputs == 1);
+        int embed_dim, num_q_heads, k_dim, v_dim, num_kv_heads,
+            tensor_parallelism_degree;
+        float dropout, scaling_factor;
+        bool bias, add_bias_kv, add_zero_attn, apply_rotary_embedding,
+            scaling_query, qk_prod_scaling, offload;
+        DataType quantization_type;
+        size_t id, transformer_layer_id;
+        dez.deserialize(id);
+        dez.deserialize(transformer_layer_id);
+        LayerID layer_guid(id, transformer_layer_id);
+        dez.deserialize(embed_dim);
+        dez.deserialize(num_q_heads);
+        dez.deserialize(k_dim);
+        dez.deserialize(v_dim);
+        dez.deserialize(dropout);
+        dez.deserialize(bias);
+        dez.deserialize(add_bias_kv);
+        dez.deserialize(add_zero_attn);
+        dez.deserialize(apply_rotary_embedding);
+        dez.deserialize(scaling_query);
+        dez.deserialize(scaling_factor);
+        dez.deserialize(qk_prod_scaling);
+        dez.deserialize(quantization_type);
+        dez.deserialize(offload);
+        dez.deserialize(num_kv_heads);
+        dez.deserialize(tensor_parallelism_degree);
+
+        IncMultiHeadSelfAttentionParams params;
+        params.embed_dim = embed_dim;
+        params.num_q_heads = num_q_heads;
+        params.kdim = k_dim;
+        params.vdim = v_dim;
+        params.dropout = dropout;
+        params.bias = bias;
+        params.add_bias_kv = add_bias_kv;
+        params.add_zero_attn = add_zero_attn;
+        params.layer_guid = layer_guid;
+        params.apply_rotary_embedding = apply_rotary_embedding;
+        params.scaling_query = scaling_query;
+        params.scaling_factor = scaling_factor;
+        params.qk_prod_scaling = qk_prod_scaling;
+        params.quantization_type = quantization_type;
+        params.offload = offload;
+        params.num_kv_heads = num_kv_heads;
+        params.tensor_parallelism_degree = tensor_parallelism_degree;
+        node = get_or_create_node<IncMultiHeadSelfAttention>(inputs[0], params);
+        break;
+      }
+      case OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(num_inputs == 1);
+        int embed_dim, num_q_heads, k_dim, v_dim, num_kv_heads;
+        float dropout, scaling_factor;
+        bool bias, add_bias_kv, add_zero_attn, apply_rotary_embedding,
+            scaling_query, qk_prod_scaling;
+        size_t id, transformer_layer_id;
+        dez.deserialize(id);
+        dez.deserialize(transformer_layer_id);
+        LayerID layer_guid(id, transformer_layer_id);
+        dez.deserialize(embed_dim);
+        dez.deserialize(num_q_heads);
+        dez.deserialize(k_dim);
+        dez.deserialize(v_dim);
+        dez.deserialize(dropout);
+        dez.deserialize(bias);
+        dez.deserialize(add_bias_kv);
+        dez.deserialize(add_zero_attn);
+        dez.deserialize(apply_rotary_embedding);
+        dez.deserialize(scaling_query);
+        dez.deserialize(scaling_factor);
+        dez.deserialize(qk_prod_scaling);
+        dez.deserialize(num_kv_heads);
+
+        SpecIncMultiHeadSelfAttentionParams params;
+        params.embed_dim = embed_dim;
+        params.num_q_heads = num_q_heads;
+        params.kdim = k_dim;
+        params.vdim = v_dim;
+        params.dropout = dropout;
+        params.bias = bias;
+        params.add_bias_kv = add_bias_kv;
+        params.add_zero_attn = add_zero_attn;
+        params.layer_guid = layer_guid;
+        params.apply_rotary_embedding = apply_rotary_embedding;
+        params.scaling_query = scaling_query;
+        params.scaling_factor = scaling_factor;
+        params.qk_prod_scaling = qk_prod_scaling;
+        params.num_kv_heads = num_kv_heads;
+        node = get_or_create_node<SpecIncMultiHeadSelfAttention>(inputs[0],
+                                                                 params);
+        break;
+      }
+      case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(num_inputs == 1);
+        int embed_dim, num_q_heads, k_dim, v_dim, num_kv_heads,
+            tensor_parallelism_degree;
+        float dropout, scaling_factor;
+        bool bias, add_bias_kv, add_zero_attn, apply_rotary_embedding,
+            scaling_query, qk_prod_scaling, offload;
+        DataType quantization_type;
+        size_t id, transformer_layer_id;
+        dez.deserialize(id);
+        dez.deserialize(transformer_layer_id);
+        LayerID layer_guid(id, transformer_layer_id);
+        dez.deserialize(embed_dim);
+        dez.deserialize(num_q_heads);
+        dez.deserialize(k_dim);
+        dez.deserialize(v_dim);
+        dez.deserialize(dropout);
+        dez.deserialize(bias);
+        dez.deserialize(add_bias_kv);
+        dez.deserialize(add_zero_attn);
+        dez.deserialize(apply_rotary_embedding);
+        dez.deserialize(scaling_query);
+        dez.deserialize(scaling_factor);
+        dez.deserialize(qk_prod_scaling);
+        dez.deserialize(quantization_type);
+        dez.deserialize(offload);
+        dez.deserialize(num_kv_heads);
+        dez.deserialize(tensor_parallelism_degree);
+
+        TreeIncMultiHeadSelfAttentionParams params;
+        params.embed_dim = embed_dim;
+        params.num_q_heads = num_q_heads;
+        params.kdim = k_dim;
+        params.vdim = v_dim;
+        params.dropout = dropout;
+        params.bias = bias;
+        params.add_bias_kv = add_bias_kv;
+        params.add_zero_attn = add_zero_attn;
+        params.layer_guid = layer_guid;
+        params.apply_rotary_embedding = apply_rotary_embedding;
+        params.scaling_query = scaling_query;
+        params.scaling_factor = scaling_factor;
+        params.qk_prod_scaling = qk_prod_scaling;
+        params.quantization_type = quantization_type;
+        params.offload = offload;
+        params.num_kv_heads = num_kv_heads;
+        params.tensor_parallelism_degree = tensor_parallelism_degree;
+        node = get_or_create_node<TreeIncMultiHeadSelfAttention>(inputs[0],
+                                                                 params);
+        break;
+      }
       case OP_TOPK: {
         node = TopK::deserialize(*this, dez, inputs, num_inputs);
         break;
       }
+      case OP_ARG_TOPK: {
+        node = ArgTopK::deserialize(*this, dez, inputs, num_inputs);
+        break;
+      }
+      case OP_BEAM_TOPK: {
+        node = BeamTopK::deserialize(*this, dez, inputs, num_inputs);
+        break;
+      }
+      case OP_SAMPLING: {
+        node = Sampling::deserialize(*this, dez, inputs, num_inputs);
+        break;
+      }
+      case OP_ARGMAX: {
+        node = ArgMax::deserialize(*this, dez, inputs, num_inputs);
+        break;
+      }
       case OP_GROUP_BY: {
         node = Group_by::deserialize(*this, dez, inputs, num_inputs);
         break;
       }
       case OP_AGGREGATE: {
-        // node = Aggregate::deserialize(*this, dez, inputs, num_inputs);
-        int n;
-        float lambda_bal;
-        dez.deserialize(n);
-        dez.deserialize(lambda_bal);
-        assert(num_inputs == n + 4);
-        AggregateParams params;
-        params.n = n;
-        params.lambda_bal = lambda_bal;
-        node = get_or_create_node<Aggregate>(
-            {std::begin(inputs), std::begin(inputs) + num_inputs}, params);
+        node = Aggregate::deserialize(
+            *this,
+            dez,
+            {std::begin(inputs), std::begin(inputs) + num_inputs},
+            num_inputs);
+        break;
+      }
+      case OP_EXPERTS: {
+        node = Experts::deserialize(
+            *this,
+            dez,
+            {std::begin(inputs), std::begin(inputs) + num_inputs},
+            num_inputs);
         break;
       }
       case OP_POOL2D: {
@@ -2648,6 +2941,10 @@ void FFModel::deserialize_graph_optimal_view(
         node = Transpose::deserialize(*this, dez, inputs, num_inputs);
         break;
       }
+      case OP_RMS_NORM: {
+        node = RMSNorm::deserialize(*this, dez, inputs, num_inputs);
+        break;
+      }
       case OP_COMBINE: {
         assert(num_inputs == 1);
         int combine_dim, combine_degree;
@@ -2684,6 +2981,13 @@ void FFModel::deserialize_graph_optimal_view(
                                              {reduction_dim, reduction_degree});
         break;
       }
+      case OP_ALLREDUCE: {
+        assert(num_inputs == 1);
+        int allreduce_dim;
+        dez.deserialize(allreduce_dim);
+        node = get_or_create_node<AllReduce>(inputs[0], {allreduce_dim});
+        break;
+      }
       case OP_FUSED_PARALLEL: {
         assert(num_inputs == 1);
         std::vector<ParallelOpInfo> parallel_ops;
diff --git a/src/runtime/hip_helper.cpp b/src/runtime/hip_helper.cpp
index 375b4f3d53..fb570a33f5 100644
--- a/src/runtime/hip_helper.cpp
+++ b/src/runtime/hip_helper.cpp
@@ -247,22 +247,48 @@ __host__ void
   checkCUDA(hipHostFree(host_ptr));
 }
 
-miopenStatus_t
-    cudnnSetTensorDescriptorFromDomain(miopenTensorDescriptor_t tensor,
-                                       Domain domain) {
+template <typename T>
+__host__ T *download_tensor(T const *ptr, size_t num_elements) {
+  // device synchronize to make sure the data are ready
+  // checkCUDA(hipDeviceSynchronize());
+  T *host_ptr;
+  checkCUDA(hipHostMalloc(&host_ptr,
+                          sizeof(T) * num_elements,
+                          hipHostMallocPortable | hipHostMallocMapped));
+  checkCUDA(hipMemcpy(
+      host_ptr, ptr, sizeof(T) * num_elements, hipMemcpyDeviceToHost));
+  // checkCUDA(hipDeviceSynchronize());
+  return host_ptr;
+}
+
+template <typename T>
+__host__ bool download_tensor(T const *ptr, T *dst, size_t num_elements) {
+  // device synchronize to make sure the data are ready
+  // checkCUDA(hipDeviceSynchronize());
+  assert(dst != nullptr);
+  checkCUDA(
+      hipMemcpy(dst, ptr, sizeof(T) * num_elements, hipMemcpyDeviceToHost));
+  // checkCUDA(hipDeviceSynchronize());
+  return true;
+}
+
+miopenStatus_t cudnnSetTensorDescriptorFromDomain(
+    miopenTensorDescriptor_t tensor, Domain domain, DataType data_type) {
   int dims[MAX_TENSOR_DIM];
+  miopenDataType_t cudnn_data_type = ff_to_cudnn_datatype(data_type);
   switch (domain.get_dim()) {
     case 1: {
       Rect<1> rect = domain;
       dims[0] = rect.hi[0] - rect.lo[0] + 1;
-      return miopenSet4dTensorDescriptor(tensor, miopenFloat, dims[0], 1, 1, 1);
+      return miopenSet4dTensorDescriptor(
+          tensor, cudnn_data_type, dims[0], 1, 1, 1);
     }
     case 2: {
       Rect<2> rect = domain;
       dims[0] = rect.hi[0] - rect.lo[0] + 1;
       dims[1] = rect.hi[1] - rect.lo[1] + 1;
       return miopenSet4dTensorDescriptor(
-          tensor, miopenFloat, dims[1], dims[0], 1, 1);
+          tensor, cudnn_data_type, dims[1], dims[0], 1, 1);
     }
     case 3: {
       Rect<3> rect = domain;
@@ -270,7 +296,7 @@ miopenStatus_t
       dims[1] = rect.hi[1] - rect.lo[1] + 1;
       dims[2] = rect.hi[2] - rect.lo[2] + 1;
       return miopenSet4dTensorDescriptor(
-          tensor, miopenFloat, dims[2], dims[1], dims[0], 1);
+          tensor, cudnn_data_type, dims[2], dims[1], dims[0], 1);
     }
     case 4: {
       Rect<4> rect = domain;
@@ -279,7 +305,7 @@ miopenStatus_t
       dims[2] = rect.hi[2] - rect.lo[2] + 1;
       dims[3] = rect.hi[3] - rect.lo[3] + 1;
       return miopenSet4dTensorDescriptor(
-          tensor, miopenFloat, dims[3], dims[2], dims[1], dims[0]);
+          tensor, cudnn_data_type, dims[3], dims[2], dims[1], dims[0]);
     }
     case 5: {
       Rect<5> rect = domain;
@@ -290,7 +316,7 @@ miopenStatus_t
       dims[2] = rect.hi[2] - rect.lo[2] + 1;
       dims[3] = rect.hi[3] - rect.lo[3] + 1;
       return miopenSet4dTensorDescriptor(
-          tensor, miopenFloat, dims[3], dims[2], dims[1], dims[0]);
+          tensor, cudnn_data_type, dims[3], dims[2], dims[1], dims[0]);
     }
     default:
       assert(false && "Unsupported dim number");
@@ -300,6 +326,8 @@ miopenStatus_t
 
 miopenDataType_t ff_to_cudnn_datatype(DataType type) {
   switch (type) {
+    case DT_HALF:
+      return miopenHalf;
     case DT_FLOAT:
       return miopenFloat;
     case DT_DOUBLE:
@@ -343,16 +371,23 @@ template __global__ void
 template __global__ void
     assign_kernel<int64_t>(int64_t *ptr, coord_t size, int64_t value);
 
+template __global__ void
+    add_kernel<half>(half *dst, half const *src, size_t size);
 template __global__ void
     add_kernel<float>(float *dst, float const *src, size_t size);
 template __global__ void
     add_kernel<double>(double *dst, double const *src, size_t size);
-template __global__ void add_kernel<int>(int *dst, int const *src, size_t size);
 template __global__ void
-    add_kernel<long>(long *dst, long const *src, size_t size);
+    add_kernel<int32_t>(int32_t *dst, int32_t const *src, size_t size);
+template __global__ void
+    add_kernel<int64_t>(int64_t *dst, int64_t const *src, size_t size);
 
+template __global__ void
+    copy_kernel<half>(half *dst, half const *src, coord_t size);
 template __global__ void
     copy_kernel<float>(float *dst, float const *src, coord_t size);
+template __global__ void
+    copy_kernel<double>(double *dst, double const *src, coord_t size);
 template __global__ void
     copy_kernel<int32_t>(int32_t *dst, int32_t const *src, coord_t size);
 template __global__ void
@@ -377,7 +412,33 @@ template __global__ void apply_add_with_scale<int64_t>(int64_t *data_ptr,
 
 template __host__ void
     print_tensor<float>(float const *ptr, size_t rect, char const *prefix);
+template __host__ void
+    print_tensor<double>(double const *ptr, size_t rect, char const *prefix);
 template __host__ void
     print_tensor<int32_t>(int32_t const *ptr, size_t rect, char const *prefix);
 template __host__ void
     print_tensor<int64_t>(int64_t const *ptr, size_t rect, char const *prefix);
+template __host__ void
+    print_tensor<half>(half const *ptr, size_t rect, char const *prefix);
+
+template __host__ float *download_tensor<float>(float const *ptr,
+                                                size_t num_elements);
+template __host__ half *download_tensor<half>(half const *ptr,
+                                              size_t num_elements);
+template __host__ double *download_tensor<double>(double const *ptr,
+                                                  size_t num_elements);
+template __host__ int32_t *download_tensor<int32_t>(int32_t const *ptr,
+                                                    size_t num_elements);
+template __host__ int64_t *download_tensor<int64_t>(int64_t const *ptr,
+                                                    size_t num_elements);
+template __host__ bool
+    download_tensor<float>(float const *ptr, float *dst, size_t num_elements);
+template __host__ bool download_tensor<double>(double const *ptr,
+                                               double *dst,
+                                               size_t num_elements);
+template __host__ bool download_tensor<int32_t>(int32_t const *ptr,
+                                                int32_t *dst,
+                                                size_t num_elements);
+template __host__ bool download_tensor<int64_t>(int64_t const *ptr,
+                                                int64_t *dst,
+                                                size_t num_elements);
diff --git a/src/runtime/inference_manager.cc b/src/runtime/inference_manager.cc
new file mode 100644
index 0000000000..62ab947f8f
--- /dev/null
+++ b/src/runtime/inference_manager.cc
@@ -0,0 +1,706 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/ffconst_utils.h"
+#include "flexflow/graph.h"
+#include "flexflow/model.h"
+#include "flexflow/ops/fused.h"
+#include "flexflow/ops/noop.h"
+#include "flexflow/parallel_ops/parallel_op.h"
+#include "flexflow/request_manager.h"
+
+namespace FlexFlow {
+
+using namespace Legion;
+
+LegionRuntime::Logger::Category log_inf_mgr("InferenceManager");
+LegionRuntime::Logger::Category log_offload("Offloading");
+
+InferenceManager::InferenceManager(FFConfig const &_config,
+                                   int _max_num_tokens_per_batch)
+    : ff_config(_config), max_num_tokens_per_batch(_max_num_tokens_per_batch) {
+  num_devices = ff_config.workersPerNode * ff_config.numNodes;
+  // Check parallelization degrees
+  assert(ff_config.data_parallelism_degree <= num_devices &&
+         "Data parallelism degree exceeds number of available devices");
+  assert(num_devices % ff_config.data_parallelism_degree == 0 &&
+         "Number of available devices is not divisible by data parallelism "
+         "degree");
+  assert(ff_config.tensor_parallelism_degree <= num_devices &&
+         "Tensor parallelism degree exceeds number of available devices");
+  assert(num_devices % ff_config.tensor_parallelism_degree == 0 &&
+         "Number of available devices is not divisible by tensor parallelism "
+         "degree");
+  assert(ff_config.pipeline_parallelism_degree <= num_devices &&
+         "Pipeline parallelism degree exceeds number of available devices");
+  assert(num_devices % ff_config.pipeline_parallelism_degree == 0 &&
+         "Number of available devices is not divisible by pipeline parallelism "
+         "degree");
+  assert(ff_config.data_parallelism_degree *
+                 ff_config.tensor_parallelism_degree *
+                 ff_config.pipeline_parallelism_degree ==
+             num_devices &&
+         "Product of data, tensor, and pipeline parallelism degrees does not "
+         "match the number of available devices");
+}
+
+InferenceManager *inference_manager_singleton = nullptr;
+
+/*static*/
+InferenceManager *InferenceManager::get_inference_manager() {
+  if (inference_manager_singleton == nullptr) {
+    FFConfig ffconfig;
+    inference_manager_singleton =
+        new InferenceManager(ffconfig, BatchConfig::MAX_NUM_TOKENS);
+  }
+  return inference_manager_singleton;
+}
+
+bool parallel_tensor_list_overlaps(std::vector<ParallelTensor> const &list1,
+                                   std::vector<ParallelTensor> const &list2) {
+  for (auto const &pt1 : list1) {
+    for (auto const &pt2 : list2) {
+      if (pt1 == pt2) {
+        return true;
+      }
+    }
+  }
+  return false;
+}
+
+void InferenceManager::compile_model_and_allocate_buffer(FFModel *model) {
+  // TODO: currently assume there is a single data-parallel pipeline
+  // (i.e., data-parallel-degree == 1)
+  assert(model->config.data_parallelism_degree == 1);
+  model->config.batchSize = max_num_tokens_per_batch;
+  model->compile_inference();
+  Context ctx = model->config.lg_ctx;
+  Runtime *runtime = model->config.lg_hlr;
+
+  // std::cout << std::endl << std::endl << "Operators MVs:" << std::endl;
+  int num_transformer_layers_per_stage =
+      model->current_transformer_layer_id /
+          model->config.pipeline_parallelism_degree +
+      1;
+  int degree = model->config.data_parallelism_degree *
+               model->config.tensor_parallelism_degree;
+
+  for (int op_idx = 0; op_idx < model->operators.size(); op_idx++) {
+    Op const *op = model->operators[op_idx];
+    // Skip weight operators
+    if (op->op_type == OP_WEIGHT) {
+      continue;
+    }
+    // Get machine views
+    std::vector<MachineView> machine_views;
+    for (int j = 0; j < model->config.data_parallelism_degree; j++) {
+      MachineView mv;
+      mv.device_type == MachineView::GPU;
+      mv.ndims = 1;
+      // mv.start_device_id = 0;
+      mv.stride[0] = 1;
+      int parallel_degree = 1;
+      for (int k = 0; k < op->outputs[0]->num_dims; k++) {
+        parallel_degree *= op->outputs[0]->dims[k].degree;
+      }
+      mv.dim[0] = parallel_degree;
+      LayerID layer_guid = op->layer_guid;
+      if (op->op_type == OP_INPUT) {
+        // All inputs are assigned to the first stage
+        layer_guid.transformer_layer_id = 0;
+      } else if (layer_guid == LayerID::NO_ID) {
+        Op const *op_with_guid = op;
+        // Assert that we only have a single input
+        while (op_with_guid->layer_guid == LayerID::NO_ID) {
+          assert(op_with_guid->numInputs == 1);
+          op_with_guid = op_with_guid->inputs[0]->owner_op;
+          assert(op_with_guid != nullptr);
+        }
+        layer_guid = op_with_guid->layer_guid;
+      }
+      mv.start_device_id = degree * (layer_guid.transformer_layer_id /
+                                     num_transformer_layers_per_stage);
+      assert(mv == op->outputs[0]->machine_view);
+      machine_views.push_back(mv);
+    }
+    // std::cout << "operator: " << op->name << std::endl;
+    // for (int i = 0; i < op->numInputs; i++) {
+    //   op->inputs[i]->print("input pt");
+    //   std::cout << "input mv: " << op->inputs[i]->machine_view << std::endl;
+    // }
+    // std::cout << "Op " << op->name << ": ";
+    for (int i = 0; i < op->numOutputs; i++) {
+      ParallelTensor pt_base = op->outputs[i];
+      assert(tensor_buffer.find(pt_base) == tensor_buffer.end());
+
+      if (op->op_type == OP_REPLICATE) {
+        assert(op->numInputs == 1 && op->numOutputs == 1);
+      }
+      // pt_base->print("output pt");
+      // std::cout << "output mv: " << pt_base->machine_view << std::endl;
+
+      std::vector<ParallelTensor> list;
+      bool found_parallel_tensor = false;
+      if (model->cpu_offload) {
+        for (auto const &pre_pt : tensor_buffer) {
+          bool used_by_future_operator = false;
+          bool used_by_current_operator = false;
+          if (pre_pt.first->get_shape() != pt_base->get_shape()) {
+            // Continue if shape mismatches
+            continue;
+          }
+          // Check that pt cannot be used as an input to the current operator
+          for (int j = 0; j < op->numInputs; j++) {
+            if (parallel_tensor_list_overlaps(tensor_buffer[op->inputs[j]],
+                                              pre_pt.second)) {
+              used_by_current_operator = true;
+            }
+          }
+          for (int j = 0; j < i; j++) {
+            assert(tensor_buffer.find(op->outputs[j]) != tensor_buffer.end());
+            if (parallel_tensor_list_overlaps(tensor_buffer[op->outputs[j]],
+                                              pre_pt.second)) {
+              used_by_current_operator = true;
+            }
+          }
+          // Check that pt cannot be used by any subsequent operators
+          for (int op_idx2 = op_idx; op_idx2 < model->operators.size();
+               op_idx2++) {
+            Op const *op2 = model->operators[op_idx2];
+            for (int j = 0; j < op2->numInputs; j++) {
+              if (tensor_buffer.find(op2->inputs[j]) != tensor_buffer.end()) {
+                if (parallel_tensor_list_overlaps(tensor_buffer[op2->inputs[j]],
+                                                  pre_pt.second)) {
+                  used_by_future_operator = true;
+                }
+              }
+            }
+          }
+          if (!used_by_future_operator && !used_by_current_operator) {
+            found_parallel_tensor = true;
+            list = pre_pt.second;
+          }
+        }
+        if (!found_parallel_tensor) {
+          log_offload.print(
+              "Cannot find a previous tensor for operator(%d) output_idx(%d)",
+              op_idx,
+              i);
+        }
+      }
+      if (!found_parallel_tensor) {
+        for (int j = 0; j < model->config.data_parallelism_degree; j++) {
+          // Copy the metadata from pt_base to pt
+          ParallelTensor pt = new ParallelTensorBase(*pt_base);
+          pt->region =
+              runtime->create_logical_region(ctx,
+                                             pt_base->region.get_index_space(),
+                                             pt_base->region.get_field_space());
+          pt->part = runtime->get_logical_partition(
+              ctx, pt->region, pt_base->part.get_index_partition());
+          pt->machine_view = machine_views[j];
+          // std::cout << "output mv: " << pt->machine_view << std::endl;
+          Domain part_domain =
+              runtime->get_index_space_domain(ctx, pt_base->parallel_is);
+          assert(pt->machine_view.get_domain() == part_domain);
+          list.push_back(pt);
+        }
+      }
+      assert(tensor_buffer.find(pt_base) == tensor_buffer.end());
+      tensor_buffer[pt_base] = list;
+    }
+    // std::cout << std::endl;
+  }
+}
+
+void InferenceManager::init_operators_inference(FFModel *model) {
+  for (int batch_index = 0; batch_index < model->config.data_parallelism_degree;
+       batch_index++) {
+    int expert_device_index = 0;
+    int device_index = batch_index % num_devices;
+    for (size_t o = 0; o < model->operators.size(); o++) {
+      Op *op = model->operators[o];
+      if (op->op_type == OP_WEIGHT) {
+        continue;
+      }
+      std::vector<ParallelTensor> inputs(op->numInputs);
+      std::vector<ParallelTensor> outputs(op->numOutputs);
+      for (int i = 0; i < op->numInputs; i++) {
+        assert(op->inputs[i] != nullptr);
+        assert(op->inputs[i]->parallel_is != IndexSpace::NO_SPACE);
+        assert(tensor_buffer[op->inputs[i]].size() > batch_index);
+        inputs[i] = tensor_buffer[op->inputs[i]][batch_index];
+        assert(inputs[i]->parallel_is != IndexSpace::NO_SPACE);
+      }
+      assert(op->numOutputs > 0);
+      for (int i = 0; i < op->numOutputs; i++) {
+        assert(op->outputs[i] != nullptr);
+        assert(op->outputs[i]->parallel_is != IndexSpace::NO_SPACE);
+        assert(tensor_buffer[op->outputs[i]].size() > batch_index);
+        outputs[i] = tensor_buffer[op->outputs[i]][batch_index];
+        if (i > 0) {
+          assert(outputs[0]->machine_view == outputs[i]->machine_view);
+        }
+        assert(outputs[i]->parallel_is != IndexSpace::NO_SPACE);
+      }
+      if (op->is_parallel_op()) {
+        ((ParallelOp *)op)
+            ->create_input_partition_inference(*model, inputs, outputs);
+      }
+      op->init_inference(*model, inputs, outputs);
+    }
+  }
+}
+
+FutureMap InferenceManager::inference(FFModel *model,
+                                      int index,
+                                      BatchConfig const &bc) {
+  if (bc.get_mode() == INC_DECODING_MODE) {
+    BatchConfigFuture bcf = Future::from_value<BatchConfig>(bc);
+    return inference(model, index, bcf);
+  } else if (bc.get_mode() == BEAM_SEARCH_MODE) {
+    BatchConfig const *bc_ptr = &bc;
+    BeamSearchBatchConfig const *bsbc_ptr =
+        static_cast<BeamSearchBatchConfig const *>(bc_ptr);
+    BeamSearchBatchConfigFuture bcf =
+        Future::from_value<BeamSearchBatchConfig>(*bsbc_ptr);
+    return inference(model, index, bcf);
+  } else if (bc.get_mode() == TREE_VERIFY_MODE) {
+    BatchConfig const *bc_ptr = &bc;
+    TreeVerifyBatchConfig const *tvbc_ptr =
+        static_cast<TreeVerifyBatchConfig const *>(bc_ptr);
+    TreeVerifyBatchConfigFuture bcf =
+        Future::from_value<TreeVerifyBatchConfig>(*tvbc_ptr);
+    return inference(model, index, bcf);
+  } else {
+    assert(false && "Unsupported inference mode");
+  }
+}
+
+FutureMap InferenceManager::inference(FFModel *model,
+                                      int index,
+                                      BatchConfigFuture const &bc) {
+  // log_inf_mgr.print("mode(%d) num_active_tokens(%d) num_active_requests(%d)",
+  //                   bc.get_mode(),
+  //                   bc.num_active_tokens(),
+  //                   bc.num_active_requests());
+  //  assert(bc.num_active_tokens() > 0 && bc.num_active_requests() > 0);
+  //  We currently assume that the index-th batch will be placed
+  //  on the device_index-th device (except for the experts layers)
+  int batch_index = index % model->config.data_parallelism_degree;
+  FutureMap fm;
+  bool found_input_operator = false;
+  for (size_t o = 0; o < model->operators.size(); o++) {
+    Op *op = model->operators[o];
+    if (op->op_type == OP_WEIGHT) {
+      continue;
+    }
+    if (op->op_type == OP_INPUT) {
+      // FIXME: this is a hack, should be replace with an input ParallelTensor
+      if (found_input_operator) {
+        // there is another input for position embedding;
+        // now only used in opt model, this input should be init after token
+        // input.
+        assert(op->numOutputs == 1);
+        ParallelTensor pt = tensor_buffer[op->outputs[0]][batch_index];
+        load_positions(bc, pt, model->position_offset);
+      } else {
+        found_input_operator = true;
+        assert(op->numOutputs == 1);
+        ParallelTensor pt = tensor_buffer[op->outputs[0]][batch_index];
+        load_input_tokens_from_batch_config(bc, pt);
+      }
+    }
+
+    std::vector<ParallelTensor> inputs(op->numInputs);
+    std::vector<ParallelTensor> outputs(op->numOutputs);
+    for (int i = 0; i < op->numInputs; i++) {
+      assert(op->inputs[i] != nullptr);
+      assert(op->inputs[i]->parallel_is != IndexSpace::NO_SPACE);
+      assert(tensor_buffer[op->inputs[i]].size() > batch_index);
+      inputs[i] = tensor_buffer[op->inputs[i]][batch_index];
+      assert(inputs[i]->parallel_is != IndexSpace::NO_SPACE);
+    }
+    for (int i = 0; i < op->numOutputs; i++) {
+      assert(op->outputs[i] != nullptr);
+      assert(op->outputs[i]->parallel_is != IndexSpace::NO_SPACE);
+      if (op->op_type == OP_INPUT &&
+          tensor_buffer[op->outputs[i]].size() == 0) {
+        continue;
+      }
+      assert(tensor_buffer[op->outputs[i]].size() > batch_index);
+      outputs[i] = tensor_buffer[op->outputs[i]][batch_index];
+      assert(outputs[i]->parallel_is != IndexSpace::NO_SPACE);
+    }
+    fm = op->inference(*model, bc, inputs, outputs);
+  }
+  return fm;
+};
+
+void InferenceManager::load_input_tokens_from_batch_config(
+    BatchConfigFuture const &bc, ParallelTensor const input) {
+  Context ctx = ff_config.lg_ctx;
+  Runtime *runtime = ff_config.lg_hlr;
+  size_t machine_view_hash = input->machine_view.hash();
+  ArgumentMap argmap;
+  IndexLauncher launcher(RM_LOAD_TOKENS_TASK_ID,
+                         input->parallel_is,
+                         TaskArgument(nullptr, 0),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(
+      input->part, 0 /*projection id*/, WRITE_ONLY, EXCLUSIVE, input->region));
+  launcher.add_field(0, FID_DATA);
+  runtime->execute_index_space(ctx, launcher);
+}
+
+void InferenceManager::load_positions(BatchConfigFuture const &bc,
+                                      ParallelTensor position_input,
+                                      int offset) {
+  Context ctx = ff_config.lg_ctx;
+  Runtime *runtime = ff_config.lg_hlr;
+  size_t machine_view_hash = position_input->machine_view.hash();
+  ArgumentMap argmap;
+  IndexLauncher launcher(RM_LOAD_POSITION_TASK_ID,
+                         position_input->parallel_is,
+                         TaskArgument(&offset, sizeof(int)),
+                         argmap,
+                         Predicate::TRUE_PRED,
+                         false /*must*/,
+                         0 /*mapper_id*/,
+                         machine_view_hash);
+  launcher.add_future(bc);
+  launcher.add_region_requirement(RegionRequirement(position_input->part,
+                                                    0 /*projection id*/,
+                                                    WRITE_ONLY,
+                                                    EXCLUSIVE,
+                                                    position_input->region));
+  launcher.add_field(0, FID_DATA);
+  runtime->execute_index_space(ctx, launcher);
+}
+
+void FFModel::set_transformer_layer_id(int id) {
+  // We assume that users call this function with
+  // monotonically increasing ids
+  assert(id == current_transformer_layer_id + 1 ||
+         (id == 0 && current_transformer_layer_id == 0));
+  current_transformer_layer_id = id;
+  assert(id < MAX_NUM_TRANSFORMER_LAYERS);
+}
+
+void FFModel::set_position_offset(int offset) {
+  assert(offset == 0 || offset == 2);
+  position_offset = offset;
+}
+
+void FFModel::compile_inference() {
+  Context ctx = config.lg_ctx;
+  Runtime *runtime = config.lg_hlr;
+  config.computationMode = COMP_MODE_INFERENCE;
+  create_operators_from_layers();
+  // Launch the graph optimize task
+  {
+    FFModel *model = this;
+    TaskLauncher launcher(GRAPH_OPTIMIZE_TASK_ID,
+                          TaskArgument(&model, sizeof(FFModel *)));
+    Future future = runtime->execute_task(ctx, launcher);
+
+    PCG::GraphOptimalViewSerialized ret =
+        future.get_result<PCG::GraphOptimalViewSerialized>();
+    Deserializer dez(ret.data, ret.total_bytes);
+    // Reconstruct operators
+    PCG::Graph *best_graph = new PCG::Graph(this);
+    std::unordered_map<PCG::Node, MachineView> optimal_views;
+    deserialize_graph_optimal_view(dez, best_graph, optimal_views);
+    operators.clear();
+    convert_graph_to_operators(best_graph, optimal_views);
+    best_graph->print_dot();
+    delete best_graph;
+    for (auto const &layer : layers) {
+      // map inputs to parallel tensor
+      if (layer->op_type == OP_INPUT) {
+        Tensor tensor = layer->outputs[0];
+        ParallelTensor parallel_tensor = nullptr;
+        for (auto const &op : operators) {
+          if (op->op_type == OP_INPUT) {
+            NoOp *noop = (NoOp *)op;
+            if (noop->input_tensor_guid == tensor->tensor_guid) {
+              parallel_tensor = op->outputs[0];
+            }
+          }
+        }
+        assert(parallel_tensor != nullptr);
+        tensor->parallel_tensor = parallel_tensor;
+      }
+      // map weights to parallel_tensor
+      for (int i = 0; i < layer->numWeights; i++) {
+        assert(layer->weights[i] != nullptr);
+        Tensor weight = layer->weights[i];
+        ParallelTensor parallel_weight = nullptr;
+        for (auto const &op : operators) {
+          if (op->layer_guid == layer->layer_guid) {
+            assert(op->op_type == layer->op_type);
+            assert(op->numWeights == layer->numWeights);
+            parallel_weight = op->weights[i];
+          }
+        }
+        assert(parallel_weight != nullptr);
+        weight->parallel_tensor = parallel_weight;
+      }
+    }
+  }
+  loss_op = nullptr;
+  metrics_op = nullptr;
+  // Perform inplace optimizations
+  if (config.enable_inplace_optimizations) {
+    for (size_t l = 1; l < operators.size(); l++) {
+      if (operators[l]->can_inplace_output()) {
+        // Assume outputs[0] is inplace with inputs[0]
+        assert(operators[l]->numOutputs == 1);
+        if (operators[l]->inputs[0]->owner_op != NULL) {
+          // int dim1 = operators[l]->outputs[0]->num_dims;
+          // int dim2 = operators[l]->inputs[0]->num_dims;
+          MachineView view1 = operators[l]->outputs[0]->machine_view;
+          MachineView view2 = operators[l]->inputs[0]->machine_view;
+          if (view1 == view2) {
+            // Check no others also need operators[l]->inputs[0]
+            bool found = false;
+            for (size_t i = 0; i < operators.size(); i++) {
+              if (i == l) {
+                continue;
+              }
+              for (int j = 0; j < operators[i]->numInputs; j++) {
+                if ((operators[i]->inputs[j]->owner_op ==
+                     operators[l]->inputs[0]->owner_op) &&
+                    (operators[i]->inputs[j]->owner_idx ==
+                     operators[l]->inputs[0]->owner_idx)) {
+                  found = true;
+                }
+              }
+            }
+            if (!found) {
+              // Perform inplace
+              operators[l]->do_inplace_output();
+            }
+          }
+        }
+      }
+    }
+  }
+
+  for (size_t l = 0; l < operators.size(); l++) {
+    Op *op = operators[l];
+
+    for (int i = 0; i < op->numInputs; i++) {
+      assert(op->inputs[i]->owner_op != NULL);
+    }
+    for (int i = 0; i < op->numWeights; i++) {
+      assert(op->weights[i]->owner_op != NULL);
+      assert(op->weights[i]->region != LogicalRegion::NO_REGION);
+      parameters.push_back(op->weights[i]);
+    }
+    op->map_output_tensors(*this);
+  }
+
+  // Check correctness
+  for (size_t l = 0; l < operators.size(); l++) {
+    Op *op = operators[l];
+    for (int i = 0; i < op->numOutputs; i++) {
+      assert(op->outputs[i]->owner_op == op);
+      assert(op->outputs[i]->owner_idx == i);
+      assert(op->outputs[i]->parallel_tensor_guid != 0);
+    }
+  }
+  // Perform fusion optimizations
+  if (config.perform_fusion) {
+    fprintf(stderr, "Applying fusion optimizations during compilation...\n");
+    fprintf(stderr, "%zu operators before fusion...\n", operators.size());
+    std::vector<Op *> new_operators;
+    std::vector<Op *> old_operators = operators;
+    while (apply_fusion(operators, new_operators)) {
+      for (size_t i = 0; i < new_operators.size(); i++) {
+        for (int idx = 0; idx < new_operators[i]->numInputs; idx++) {
+          for (size_t j = i + 1; j < new_operators.size(); j++) {
+            if (new_operators[i]->inputs[idx]->owner_op == new_operators[j]) {
+              assert(false);
+            }
+          }
+        }
+      }
+      operators = new_operators;
+    }
+    // Check integrity
+    for (size_t l = 0; l < operators.size(); l++) {
+      if (operators[l]->op_type == OP_FUSED) {
+        FusedOp *fused = (FusedOp *)operators[l];
+        int ioff = 0, woff = 0, ooff = 0;
+        for (int op = 0; op < fused->numOperators; op++) {
+          Op *old_op = fused->operators[op];
+          for (int i = 0; i < fused->op_num_inputs[op]; i++) {
+            int my_off = fused->op_input_idx[i + ioff];
+            if (fused->op_input_source[i + ioff] == FusedOp::SOURCE_INPUT) {
+              assert(fused->inputs[my_off]->region ==
+                     old_op->inputs[i]->region);
+            } else if (fused->op_input_source[i + ioff] ==
+                       FusedOp::SOURCE_OUTPUT) {
+              assert(fused->outputs[my_off]->region ==
+                     old_op->inputs[i]->region);
+            } else {
+              assert(false);
+            }
+          }
+          for (int i = 0; i < fused->op_num_weights[op]; i++) {
+            int my_off = fused->op_weight_idx[i + woff];
+            assert(fused->op_weight_source[i + woff] == FusedOp::SOURCE_WEIGHT);
+            assert(fused->weights[my_off]->region ==
+                   old_op->weights[i]->region);
+          }
+          for (int i = 0; i < fused->op_num_outputs[op]; i++) {
+            int my_off = fused->op_output_idx[i + ooff];
+            assert(fused->op_output_source[i + ooff] == FusedOp::SOURCE_OUTPUT);
+            assert(fused->outputs[my_off]->region ==
+                   old_op->outputs[i]->region);
+          }
+          ioff += fused->op_num_inputs[op];
+          woff += fused->op_num_weights[op];
+          ooff += fused->op_num_outputs[op];
+        }
+      } else {
+        bool found = false;
+        for (size_t i = 0; i < old_operators.size(); i++) {
+          if (old_operators[i] == operators[l]) {
+            assert(!found);
+            found = true;
+          }
+        }
+        assert(found);
+      }
+    }
+    fprintf(stderr, "%zu operators after fusion...\n", operators.size());
+    for (size_t i = 0; i < operators.size(); i++) {
+      Op *op = operators[i];
+      printf("operator[%zu]: type(%s) guid(%lu)\n",
+             i,
+             get_operator_type_name(operators[i]->op_type).c_str(),
+             operators[i]->op_guid);
+      for (int j = 0; j < op->numInputs; j++) {
+        LogicalRegion handle = op->inputs[j]->region;
+        printf("\tinputs[%d] region(%d,%d,%d)\n",
+               j,
+               handle.get_index_space().get_id(),
+               handle.get_field_space().get_id(),
+               handle.get_tree_id());
+      }
+      for (int j = 0; j < op->numOutputs; j++) {
+        LogicalRegion handle = op->outputs[j]->region;
+        printf("\toutputs[%d] region(%d,%d,%d)\n",
+               j,
+               handle.get_index_space().get_id(),
+               handle.get_field_space().get_id(),
+               handle.get_tree_id());
+      }
+      for (int j = 0; j < op->numWeights; j++) {
+        LogicalRegion handle = op->weights[j]->region;
+        printf("\tweights[%d] region(%d,%d,%d)\n",
+               j,
+               handle.get_index_space().get_id(),
+               handle.get_field_space().get_id(),
+               handle.get_tree_id());
+      }
+    }
+  }
+  for (size_t i = 0; i < operators.size(); i++) {
+    Op *op = operators[i];
+    printf("operator[%zu]: type(%d)\n", i, operators[i]->op_type);
+    for (int j = 0; j < op->numInputs; j++) {
+      LogicalRegion handle = op->inputs[j]->region;
+      printf("\tinputs[%d] region(%d,%d,%d)\n",
+             j,
+             handle.get_index_space().get_id(),
+             handle.get_field_space().get_id(),
+             handle.get_tree_id());
+    }
+    for (int j = 0; j < op->numOutputs; j++) {
+      LogicalRegion handle = op->outputs[j]->region;
+      printf("\toutputs[%d] region(%d,%d,%d)\n",
+             j,
+             handle.get_index_space().get_id(),
+             handle.get_field_space().get_id(),
+             handle.get_tree_id());
+    }
+  }
+#ifdef FF_USE_NCCL
+  for (size_t l = 0; l < operators.size(); l++) {
+    // Only create nccl for allreduce and fusedop for inference
+    // (fusedop may include allreduces)
+    if (operators[l]->op_type == OP_ALLREDUCE ||
+        operators[l]->op_type == OP_FUSED) {
+      MachineView view = operators[l]->outputs[0]->machine_view;
+      if (view_hash_to_nccl_comms.find(view.hash()) ==
+          view_hash_to_nccl_comms.end()) {
+        TaskLauncher launcher(NCCL_GETUNIQUEID_TASK_ID, TaskArgument(NULL, 0));
+        Future future = runtime->execute_task(ctx, launcher);
+        ncclUniqueId ncclId = future.get_result<ncclUniqueId>();
+        IndexSpace task_is = get_or_create_task_is(view);
+        ArgumentMap argmap;
+        IndexLauncher index_launcher(
+            NCCL_INIT_COMMS_TASK_ID,
+            task_is,
+            TaskArgument(&ncclId, sizeof(ncclUniqueId)),
+            argmap,
+            Predicate::TRUE_PRED,
+            false /*must*/,
+            0 /*mapper_id*/,
+            view.hash() /*MappingTagID*/);
+        FutureMap fm = runtime->execute_index_space(ctx, index_launcher);
+        fm.wait_all_results();
+        int idx = 0;
+        Domain task_domain = runtime->get_index_space_domain(ctx, task_is);
+        ncclComm_t *nccl_comms =
+            (ncclComm_t *)malloc(sizeof(ncclComm_t) * task_domain.get_volume());
+        for (Domain::DomainPointIterator it(task_domain); it; it++, idx++) {
+          nccl_comms[idx] = fm.get_result<ncclComm_t>(*it);
+        }
+        view_hash_to_nccl_comms[view.hash()] = nccl_comms;
+      }
+    }
+  }
+#endif
+}
+
+std::string join_path(std::vector<std::string> const &paths) {
+  std::string joined;
+  for (auto const &path : paths) {
+    if (joined.empty()) {
+      joined = path;
+    } else {
+      if (path[0] == '/') {
+        joined = path;
+      } else if (joined.back() != '/') {
+        joined += '/';
+        joined += path;
+      } else {
+        joined += path;
+      }
+    }
+  }
+  return joined;
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/layer.cc b/src/runtime/layer.cc
index 6dfd5f2f35..d2473f4b2b 100644
--- a/src/runtime/layer.cc
+++ b/src/runtime/layer.cc
@@ -16,8 +16,9 @@ Layer::Layer(FFModel *model,
              const Tensor _input3,
              const Tensor _input4)
     : op_type(_otype), data_type(_dtype),
-      layer_guid(model->layer_global_guid++), numInputs(_numInputs),
-      numWeights(_numWeights), numOutputs(_numOutputs) {
+      layer_guid(model->layer_global_guid++,
+                 model->current_transformer_layer_id),
+      numInputs(_numInputs), numWeights(_numWeights), numOutputs(_numOutputs) {
   std::string pcname;
   if (_name == nullptr) {
     pcname = get_operator_type_name(op_type);
@@ -50,8 +51,9 @@ Layer::Layer(FFModel *model,
              int _numOutputs,
              Tensor const *_tensors)
     : op_type(_otype), data_type(_dtype),
-      layer_guid(model->layer_global_guid++), numInputs(_numInputs),
-      numWeights(_numWeights), numOutputs(_numOutputs) {
+      layer_guid(model->layer_global_guid++,
+                 model->current_transformer_layer_id),
+      numInputs(_numInputs), numWeights(_numWeights), numOutputs(_numOutputs) {
   std::string pcname;
   if (_name == nullptr) {
     pcname = get_operator_type_name(op_type);
diff --git a/src/runtime/memory_allocator.cc b/src/runtime/memory_allocator.cc
new file mode 100644
index 0000000000..06a7c468a4
--- /dev/null
+++ b/src/runtime/memory_allocator.cc
@@ -0,0 +1,54 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/utils/memory_allocator.h"
+
+namespace FlexFlow {
+
+// declare Legion names
+using Legion::coord_t;
+using Legion::Memory;
+using Realm::RegionInstance;
+
+MemoryAllocator::MemoryAllocator(Memory _memory)
+    : memory(_memory), reserved_ptr(nullptr), instance_ptr(nullptr),
+      reserved_total_size(0), reserved_allocated_size(0),
+      instance_total_size(0), instance_allocated_size(0) {}
+
+void MemoryAllocator::create_legion_instance(RegionInstance &inst,
+                                             size_t size) {
+  // Assert that we have used up previously created region instance
+  assert(instance_total_size == instance_allocated_size);
+  Realm::Rect<1, coord_t> bounds(Realm::Point<1, coord_t>(0),
+                                 Realm::Point<1, coord_t>(size - 1));
+  std::vector<size_t> field_sizes;
+  field_sizes.push_back(sizeof(char));
+  Realm::RegionInstance::create_instance(
+      inst, memory, bounds, field_sizes, 0, Realm::ProfilingRequestSet())
+      .wait();
+  instance_ptr = inst.pointer_untyped(0, 0);
+  instance_total_size = size;
+  instance_allocated_size = 0;
+}
+
+void MemoryAllocator::register_reserved_work_space(void *base, size_t size) {
+  // Assert that we haven't allocated anything before
+  assert(reserved_total_size == 0);
+  reserved_ptr = base;
+  reserved_total_size = size;
+  reserved_allocated_size = 0;
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/model.cc b/src/runtime/model.cc
index dbe4a7d92c..43b5df1f39 100644
--- a/src/runtime/model.cc
+++ b/src/runtime/model.cc
@@ -24,9 +24,12 @@
 #include "flexflow/mapper.h"
 #include "flexflow/ops/aggregate.h"
 #include "flexflow/ops/aggregate_spec.h"
+#include "flexflow/ops/arg_topk.h"
+#include "flexflow/ops/argmax.h"
 #include "flexflow/ops/attention.h"
 #include "flexflow/ops/batch_matmul.h"
 #include "flexflow/ops/batch_norm.h"
+#include "flexflow/ops/beam_topk.h"
 #include "flexflow/ops/cache.h"
 #include "flexflow/ops/cast.h"
 #include "flexflow/ops/concat.h"
@@ -35,10 +38,12 @@
 #include "flexflow/ops/element_binary.h"
 #include "flexflow/ops/element_unary.h"
 #include "flexflow/ops/embedding.h"
+#include "flexflow/ops/experts.h"
 #include "flexflow/ops/flat.h"
 #include "flexflow/ops/fused.h"
 #include "flexflow/ops/gather.h"
 #include "flexflow/ops/groupby.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
 #include "flexflow/ops/layer_norm.h"
 #include "flexflow/ops/linear.h"
 #include "flexflow/ops/noop.h"
@@ -46,15 +51,21 @@
 #include "flexflow/ops/reduce.h"
 #include "flexflow/ops/reshape.h"
 #include "flexflow/ops/reverse.h"
+#include "flexflow/ops/rms_norm.h"
+#include "flexflow/ops/sampling.h"
 #include "flexflow/ops/softmax.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
 #include "flexflow/ops/split.h"
 #include "flexflow/ops/topk.h"
 #include "flexflow/ops/transpose.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/parallel_ops/allreduce.h"
 #include "flexflow/parallel_ops/combine.h"
 #include "flexflow/parallel_ops/fused_parallel_op.h"
 #include "flexflow/parallel_ops/partition.h"
 #include "flexflow/parallel_ops/reduction.h"
 #include "flexflow/parallel_ops/replicate.h"
+#include "flexflow/request_manager.h"
 #include "flexflow/substitution.h"
 #include "flexflow/utils/random_utils.h"
 #include "flexflow/utils/test_utils.h"
@@ -591,11 +602,35 @@ ncclComm_t Op::init_nccl_comms_task(Task const *task,
 }
 #endif
 
+/**
+ * @brief The ParallelDimMappingRecord class's constructor. It sets the object's
+ * type field equal to the value passed as the constructor's argument, and
+ * initializes all other fields to -1.
+ *
+ * @param[in]   type  The MappingRecordType to use to initialize the
+ * ParallelDimMappingRecord.
+ */
 ParallelDimMappingRecord::ParallelDimMappingRecord(MappingRecordType type)
     : type(type), output_dim(-1), input_dim(-1), weight_dim(-1), output_idx(-1),
       input_idx(-1), weight_idx(-1) {}
 
 /*static*/
+/**
+ * @brief Builds and initializes a ParallelDimMappingRecord object of
+ * INPUT_OUTPUT MappingRecordType.
+ *
+ * This function should be used to create a ParallelDimMappingRecord to track an
+ * operator's dimension relation between the input and the output tensor
+ *
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   input_dim   The index of the input dimension part of the
+ * dimension relation
+ * @param[in]   output_idx  The index of the output tensor (nonzero if there are
+ * multiple outputs)
+ * @param[in]   output_dim  The index of the output dimension part of the
+ * dimension relation
+ */
 ParallelDimMappingRecord ParallelDimMappingRecord::input_output_record(
     int input_idx,
     int input_dim,
@@ -619,6 +654,22 @@ ParallelDimMappingRecord ParallelDimMappingRecord::input_output_record(
 }
 
 /*static*/
+/**
+ * @brief Builds and initializes a ParallelDimMappingRecord object of
+ * INPUT_WEIGHT MappingRecordType.
+ *
+ * This function should be used to create a ParallelDimMappingRecord to track an
+ * operator's dimension relation between the input and the weights tensor
+ *
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   input_dim   The index of the input dimension part of the
+ * dimension relation
+ * @param[in]   weight_idx  The index of the weight tensor (nonzero if there are
+ * multiple weights)
+ * @param[in]   weight_dim  The index of the weight dimension part of the
+ * dimension relation
+ */
 ParallelDimMappingRecord ParallelDimMappingRecord::input_weight_record(
     int input_idx,
     int input_dim,
@@ -646,6 +697,39 @@ MappingRecordType ParallelDimMappingRecord::get_type() const {
 }
 
 /*static*/
+/** @brief A wrapper around the main version of the
+ * construct_weight_parallel_dims function.
+ *
+ * This wrapper allows you to append multiple dimension relations at once to a
+ * vector of ParallelDimMappingRecord entries. The relations must be between
+ * dimensions of the same pair of input and weight tensors. Unlike the other
+ * construct_weight_parallel_dims wrapper below, this function allows you to
+ * specify the MappingOperation for each pair of dimensions for which you will
+ * be creating a new ParallelDimMappingRecord.
+ *
+ * The function takes a vector of (int, MappingOperation, int) tuples, where the
+ * int members represent the indexes of the two dimensions in a relation, and
+ * the MappingOperation member specifies the type of mapping operation. Just
+ * like the other wrapper, this function simply calls the main version of
+ * construct_weight_parallel_dims for each pair, using the same values across
+ * all calls for all other parameters.
+ *
+ * This function should NOT be used to track dimension relations between the
+ * input and weights tensors; construct_weight_parallel_dims should be used
+ * instead.
+ *
+ * @param[out]  records     The (potentially empty) vector of existing
+ * ParallelDimMappingRecord entries
+ * @param[in]   mappings    A vector of tuples, each including a pair of
+ * integers (representing the indexes of the input and weight dimensions in a
+ * relation), and a MappingOperation, specifying the mapping operation for the
+ * pair of dimensions.
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   weight_idx  The index of the weight tensor (nonzero if there are
+ * multiple weights)
+ *
+ */
 void Op::construct_weight_parallel_dims(
     std::vector<ParallelDimMappingRecord> &records,
     std::vector<std::tuple<int, MappingOperation, int>> mappings,
@@ -662,6 +746,30 @@ void Op::construct_weight_parallel_dims(
 }
 
 /*static*/
+/** @brief A wrapper around the main version of the
+ * construct_weight_parallel_dims function.
+ *
+ * This wrapper allows you to append multiple dimension relations at once to a
+ * vector of ParallelDimMappingRecord entries. The relations must be between
+ * dimensions of the same pair of input and weight tensors. The function takes a
+ * vector of (input, weight) dimension index pairs and simply calls the main
+ * version of construct_weight_parallel_dims for each such pair, using the same
+ * values across all calls for all other parameters.
+ *
+ * This function should NOT be used to track dimension relations between the
+ * input and weights tensors; construct_weight_parallel_dims should be used
+ * instead.
+ *
+ * @param[out]  records     The (potentially empty) vector of existing
+ * ParallelDimMappingRecord entries
+ * @param[in]   mappings    A vector of integer pairs, each representing the
+ * indexes of the input and weight dimensions in a relation.
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   weight_idx  The index of the weight tensor (nonzero if there are
+ * multiple weights)
+ *
+ */
 void Op::construct_weight_parallel_dims(
     std::vector<ParallelDimMappingRecord> &records,
     std::vector<std::pair<int, int>> mappings,
@@ -674,6 +782,30 @@ void Op::construct_weight_parallel_dims(
 }
 
 /*static*/
+/**
+ * @brief Creates a new ParallelDimMappingRecord (of the INPUT_WEIGHT
+ * MappingRecordType flavor) and appends it to an existing vector of
+ * ParallelDimMappingRecord entries.
+ *
+ * This function creates a new ParallelDimMappingRecord to track a dimension
+ * relation between a dimension from the input tensor and a dimension from the
+ * weight tensor. This function should NOT be used to track dimension relations
+ * between the input and output tensors; construct_output_parallel_dims should
+ * be used instead.
+ *
+ * @param[out]  records     The (potentially empty) vector of existing
+ * ParallelDimMappingRecord entries
+ * @param[in]   input_dim   The index of the input dimension part of the
+ * dimension relation
+ * @param[in]   weight_dim  The index of the weight dimension part of the
+ * dimension relation
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   weight_idx  The index of the weight tensor (nonzero if there are
+ * multiple weights)
+ * @param[in]   operation   The parallelization operation (partition or
+ * replication) associated with the dimension relation
+ */
 void Op::construct_weight_parallel_dims(
     std::vector<ParallelDimMappingRecord> &records,
     int input_dim,
@@ -685,12 +817,20 @@ void Op::construct_weight_parallel_dims(
       input_idx, input_dim, weight_idx, weight_dim, operation));
 }
 
+/** @brief  Calls the corresponding version of construct_weight_parallel_dims,
+ * and passes the Op class's parallel_dims_mapping vector, so that the resulting
+ * ParallelDimMappingRecord are appended to it
+ */
 void Op::register_weight_parallel_dims(
     std::vector<std::pair<int, int>> mappings, int input_idx, int weight_idx) {
   Op::construct_weight_parallel_dims(
       *this->parallel_dims_mapping, mappings, input_idx, weight_idx);
 }
 
+/** @brief  Calls the corresponding version of construct_weight_parallel_dims,
+ * and passes the Op class's parallel_dims_mapping vector, so that the resulting
+ * ParallelDimMappingRecord are appended to it
+ */
 void Op::register_weight_parallel_dims(
     std::vector<std::tuple<int, MappingOperation, int>> mappings,
     int input_idx,
@@ -699,6 +839,10 @@ void Op::register_weight_parallel_dims(
       *this->parallel_dims_mapping, mappings, input_idx, weight_idx);
 }
 
+/** @brief  Calls the corresponding version of construct_weight_parallel_dims,
+ * and passes the Op class's parallel_dims_mapping vector, so that the resulting
+ * ParallelDimMappingRecord are appended to it
+ */
 void Op::register_weight_parallel_dims(
     int input_dim,
     int weight_dim,
@@ -714,6 +858,39 @@ void Op::register_weight_parallel_dims(
 }
 
 /*static*/
+/** @brief A wrapper around the main version of the
+ * construct_output_parallel_dims function.
+ *
+ * This wrapper allows you to append multiple dimension relations at once to a
+ * vector of ParallelDimMappingRecord entries. The relations must be between
+ * dimensions of the same pair of input and output tensors. Unlike the other
+ * construct_output_parallel_dims wrapper below, this function allows you to
+ * specify the MappingOperation for each pair of dimensions for which you will
+ * be creating a new ParallelDimMappingRecord.
+ *
+ * The function takes a vector of (int, MappingOperation, int) tuples, where the
+ * int members represent the indexes of the two dimensions in a relation, and
+ * the MappingOperation member specifies the type of mapping operation. Just
+ * like the other wrapper, this function simply calls the main version of
+ * construct_output_parallel_dims for each pair, using the same values across
+ * all calls for all other parameters.
+ *
+ * This function should NOT be used to track dimension relations between the
+ * input and weights tensors; construct_weight_parallel_dims should be used
+ * instead.
+ *
+ * @param[out]  records     The (potentially empty) vector of existing
+ * ParallelDimMappingRecord entries
+ * @param[in]   mappings    A vector of tuples, each including a pair of
+ * integers (representing the indexes of the input and output dimensions in a
+ * relation), and a MappingOperation, specifying the mapping operation for the
+ * pair of dimensions.
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   output_idx  The index of the output tensor (nonzero if there are
+ * multiple outputs)
+ *
+ */
 void Op::construct_output_parallel_dims(
     std::vector<ParallelDimMappingRecord> &records,
     std::vector<std::tuple<int, MappingOperation, int>> mappings,
@@ -730,6 +907,30 @@ void Op::construct_output_parallel_dims(
 }
 
 /*static*/
+/** @brief A wrapper around the main version of the
+ * construct_output_parallel_dims function.
+ *
+ * This wrapper allows you to append multiple dimension relations at once to a
+ * vector of ParallelDimMappingRecord entries. The relations must be between
+ * dimensions of the same pair of input and output tensors. The function takes a
+ * vector of (input, output) dimension index pairs and simply calls the main
+ * version of construct_output_parallel_dims for each such pair, using the same
+ * values across all calls for all other parameters.
+ *
+ * This function should NOT be used to track dimension relations between the
+ * input and weights tensors; construct_weight_parallel_dims should be used
+ * instead.
+ *
+ * @param[out]  records     The (potentially empty) vector of existing
+ * ParallelDimMappingRecord entries
+ * @param[in]   mappings    A vector of integer pairs, each representing the
+ * indexes of the input and output dimensions in a relation.
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   output_idx  The index of the output tensor (nonzero if there are
+ * multiple outputs)
+ *
+ */
 void Op::construct_output_parallel_dims(
     std::vector<ParallelDimMappingRecord> &records,
     std::vector<std::pair<int, int>> mappings,
@@ -742,6 +943,30 @@ void Op::construct_output_parallel_dims(
 }
 
 /*static*/
+/**
+ * @brief Creates a new ParallelDimMappingRecord (of the INPUT_OUTPUT
+ * MappingRecordType flavor) and appends it to an existing vector of
+ * ParallelDimMappingRecord entries.
+ *
+ * This function creates a new ParallelDimMappingRecord to track a dimension
+ * relation between a dimension from the input tensor and a dimension from the
+ * output tensor. This function should NOT be used to track dimension relations
+ * between the input and weights tensors; construct_weight_parallel_dims should
+ * be used instead.
+ *
+ * @param[out]  records     The (potentially empty) vector of existing
+ * ParallelDimMappingRecord entries
+ * @param[in]   input_dim   The index of the input dimension part of the
+ * dimension relation
+ * @param[in]   output_dim  The index of the output dimension part of the
+ * dimension relation
+ * @param[in]   input_idx   The index of the input tensor (nonzero if there are
+ * multiple inputs)
+ * @param[in]   output_idx  The index of the output tensor (nonzero if there are
+ * multiple outputs)
+ * @param[in]   operation   The parallelization operation (partition or
+ * replication) associated with the dimension relation
+ */
 void Op::construct_output_parallel_dims(
     std::vector<ParallelDimMappingRecord> &records,
     int input_dim,
@@ -753,12 +978,20 @@ void Op::construct_output_parallel_dims(
       input_idx, input_dim, output_idx, output_dim, operation));
 }
 
+/** @brief  Calls the corresponding version of construct_output_parallel_dims,
+ * and passes the Op class's parallel_dims_mapping vector, so that the resulting
+ * ParallelDimMappingRecord are appended to it
+ */
 void Op::register_output_parallel_dims(
     std::vector<std::pair<int, int>> mappings, int input_idx, int output_idx) {
   Op::construct_output_parallel_dims(
       *this->parallel_dims_mapping, mappings, input_idx, output_idx);
 }
 
+/** @brief  Calls the corresponding version of construct_output_parallel_dims,
+ * and passes the Op class's parallel_dims_mapping vector, so that the resulting
+ * ParallelDimMappingRecord are appended to it
+ */
 void Op::register_output_parallel_dims(
     std::vector<std::tuple<int, MappingOperation, int>> mappings,
     int input_idx,
@@ -767,6 +1000,10 @@ void Op::register_output_parallel_dims(
       *this->parallel_dims_mapping, mappings, input_idx, output_idx);
 }
 
+/** @brief  Calls the corresponding version of construct_output_parallel_dims,
+ * and passes the Op class's parallel_dims_mapping vector, so that the resulting
+ * ParallelDimMappingRecord are appended to it
+ */
 void Op::register_output_parallel_dims(
     int input_dim,
     int output_dim,
@@ -975,6 +1212,50 @@ void Op::set_argumentmap_for_init(FFModel const &ff, ArgumentMap &argmap) {
   }
 }
 
+void Op::set_argumentmap_for_init_inference(FFModel const &ff,
+                                            ArgumentMap &argmap,
+                                            ParallelTensor const output0) {
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  Domain domain = runtime->get_index_space_domain(ctx, this->parallel_is);
+  MachineView const view = output0->machine_view;
+  assert(ff.config.computationMode == COMP_MODE_INFERENCE);
+  switch (domain.get_dim()) {
+#ifdef FF_USE_NCCL
+#define DIMFUNC(DIM)                                                           \
+  case DIM: {                                                                  \
+    Rect<DIM> rect = domain;                                                   \
+    int idx = 0;                                                               \
+    for (PointInRectIterator<DIM> it(rect); it(); it++) {                      \
+      FFHandler handle = ff.handlers[view.get_device_id(*it)];                 \
+      if (op_type == OP_ALLREDUCE) {                                           \
+        ncclComm_t *nccl_comms = ff.find_nccl_comms(view);                     \
+        handle.ncclComm = nccl_comms[idx++];                                   \
+      }                                                                        \
+      argmap.set_point(*it, TaskArgument(&handle, sizeof(FFHandler)));         \
+    }                                                                          \
+    break;                                                                     \
+  }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+#else
+#define DIMFUNC(DIM)                                                           \
+  case DIM: {                                                                  \
+    Rect<DIM> rect = domain;                                                   \
+    for (PointInRectIterator<DIM> it(rect); it(); it++) {                      \
+      FFHandler handle = ff.handlers[view.get_device_id(*it)];                 \
+      argmap.set_point(*it, TaskArgument(&handle, sizeof(FFHandler)));         \
+    }                                                                          \
+    break;                                                                     \
+  }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+#endif
+    default:
+      assert(false);
+  }
+}
+
 void Op::set_opmeta_from_futuremap(FFModel const &ff, FutureMap const &fm) {
   Context ctx = ff.config.lg_ctx;
   Runtime *runtime = ff.config.lg_hlr;
@@ -996,6 +1277,29 @@ void Op::set_opmeta_from_futuremap(FFModel const &ff, FutureMap const &fm) {
   }
 }
 
+void Op::set_opmeta_from_futuremap_inference(FFModel const &ff,
+                                             FutureMap const &fm,
+                                             ParallelTensor const output) {
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  Domain domain = runtime->get_index_space_domain(ctx, parallel_is);
+  switch (domain.get_dim()) {
+#define DIMFUNC(DIM)                                                           \
+  case DIM: {                                                                  \
+    Rect<DIM> rect = domain;                                                   \
+    int idx = 0;                                                               \
+    for (PointInRectIterator<DIM> it(rect); it(); it++) {                      \
+      inference_meta[output][idx++] = fm.get_result<OpMeta *>(*it);            \
+    }                                                                          \
+    break;                                                                     \
+  }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+    default:
+      assert(false);
+  }
+}
+
 void Op::set_argumentmap_for_forward(FFModel const &ff, ArgumentMap &argmap) {
   Context ctx = ff.config.lg_ctx;
   Runtime *runtime = ff.config.lg_hlr;
@@ -1018,6 +1322,30 @@ void Op::set_argumentmap_for_forward(FFModel const &ff, ArgumentMap &argmap) {
   }
 }
 
+void Op::set_argumentmap_for_inference(FFModel const &ff,
+                                       ArgumentMap &argmap,
+                                       ParallelTensor const output) {
+  Context ctx = ff.config.lg_ctx;
+  Runtime *runtime = ff.config.lg_hlr;
+  Domain domain = runtime->get_index_space_domain(ctx, parallel_is);
+  switch (domain.get_dim()) {
+#define DIMFUNC(DIM)                                                           \
+  case DIM: {                                                                  \
+    Rect<DIM> rect = domain;                                                   \
+    int idx = 0;                                                               \
+    for (PointInRectIterator<DIM> it(rect); it(); it++) {                      \
+      OpMeta *mp = inference_meta[output][idx++];                              \
+      argmap.set_point(*it, TaskArgument(&mp, sizeof(OpMeta *)));              \
+    }                                                                          \
+    break;                                                                     \
+  }
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+    default:
+      assert(false);
+  }
+}
+
 void Op::set_argumentmap_for_backward(FFModel const &ff, ArgumentMap &argmap) {
   Context ctx = ff.config.lg_ctx;
   Runtime *runtime = ff.config.lg_hlr;
@@ -1157,37 +1485,9 @@ OpMeta::OpMeta(FFHandler _handle, Op const *op) : OpMeta(_handle) {
   }
 }
 
-FFModel::FFModel(FFConfig &_config)
-    : op_global_guid(OP_GUID_FIRST_VALID),
-      layer_global_guid(LAYER_GUID_FIRST_VALID),
-      tensor_global_guid(TENSOR_GUID_FIRST_VALID),
-      parallel_tensor_global_guid(PARALLEL_TENSOR_GUID_FIRST_VALID),
-      node_global_guid(NODE_GUID_FIRST_VALID), config(_config), optimizer(NULL),
-      loss_op(NULL), metrics_op(NULL), simulator(NULL) {
-  this->search = new PCG::SearchHelper(this);
-  this->graph_search = new PCG::GraphSearchHelper(this);
-
+FFRuntime::FFRuntime(FFConfig &config) {
   Runtime *runtime = config.lg_hlr;
   Context ctx = config.lg_ctx;
-  // Register machine views
-  register_all_machine_views(config.numNodes,
-                             config.workersPerNode,
-                             config.cpusPerNode,
-                             all_valid_views);
-  metrics_input = -1;
-  // Load strategy file
-  // Create field space
-  {
-    FieldAllocator allocator =
-        runtime->create_field_allocator(ctx, config.field_space);
-    allocator.allocate_field(sizeof(float), FID_DATA);
-  }
-  // Build training dataset
-  // if (config.datasetPath.length() == 0) {
-  //  dataLoader = NULL;
-  //} else {
-  //  dataLoader = new DataLoader(config.datasetPath);
-  //}
 
   ArgumentMap argmap;
   Rect<1> task_rect(Point<1>(0),
@@ -1200,6 +1500,9 @@ FFModel::FFModel(FFConfig &_config)
     // info.myRank = rank++;
     // info.allRanks = config.workersPerNode * config.numNodes;
     info.workSpaceSize = config.workSpaceSize;
+    info.offload_reserve_space_size =
+        config.cpu_offload ? config.offload_reserve_space_size : 0;
+    info.quantization_type = config.quantization_type;
     info.allowTensorOpMathConversion = config.allow_tensor_op_math_conversion;
     argmap.set_point(*it, TaskArgument(&info, sizeof(FFInitInfo)));
   }
@@ -1221,6 +1524,50 @@ FFModel::FFModel(FFConfig &_config)
   }
 }
 
+FFRuntime *ffruntime_singleton = nullptr;
+
+FFModel::FFModel(FFConfig &_config, bool cpu_offload)
+    : op_global_guid(OP_GUID_FIRST_VALID),
+      layer_global_guid(LAYER_GUID_FIRST_VALID),
+      tensor_global_guid(TENSOR_GUID_FIRST_VALID),
+      parallel_tensor_global_guid(PARALLEL_TENSOR_GUID_FIRST_VALID),
+      node_global_guid(NODE_GUID_FIRST_VALID), current_transformer_layer_id(0),
+      config(_config), optimizer(NULL), loss_op(NULL), metrics_op(NULL),
+      simulator(NULL) {
+  this->search = new PCG::SearchHelper(this);
+  this->graph_search = new PCG::GraphSearchHelper(this);
+  this->cpu_offload = cpu_offload;
+
+  if (ffruntime_singleton == nullptr) {
+    ffruntime_singleton = new FFRuntime(_config);
+  }
+
+  Runtime *runtime = config.lg_hlr;
+  Context ctx = config.lg_ctx;
+  // Register machine views
+  register_all_machine_views(config.numNodes,
+                             config.workersPerNode,
+                             config.cpusPerNode,
+                             all_valid_views);
+  metrics_input = -1;
+  // Load strategy file
+  // Create field space
+  //{
+  //  FieldAllocator allocator =
+  //      runtime->create_field_allocator(ctx, config.field_space);
+  //  allocator.allocate_field(sizeof(float), FID_DATA);
+  //}
+  // Build training dataset
+  // if (config.datasetPath.length() == 0) {
+  //  dataLoader = NULL;
+  //} else {
+  //  dataLoader = new DataLoader(config.datasetPath);
+  //}
+  for (int idx = 0; idx < config.workersPerNode * config.numNodes; idx++) {
+    handlers[idx] = ffruntime_singleton->handlers[idx];
+  }
+}
+
 void FFModel::clear_graph_search_cache() {
   this->graph_search->clear_cache();
   this->search->clear_cache();
@@ -1231,7 +1578,7 @@ ncclComm_t *FFModel::find_nccl_comms(MachineView const &view) const {
   auto const &it = view_hash_to_nccl_comms.find(view.hash());
   if (it == view_hash_to_nccl_comms.end()) {
     assert(config.computationMode == COMP_MODE_INFERENCE);
-    return NULL;
+    return nullptr;
   } else {
     return it->second;
   }
@@ -1487,6 +1834,7 @@ ParallelParameter FFModel::create_parallel_weight(const ParallelDim dims[],
   for (int i = 0; i < NDIM; i++) {
     p->dims[i] = dims[NDIM - 1 - i];
   }
+
   assert(p->get_volume() > 0);
   assert(p->check_valid());
   return p;
@@ -1601,6 +1949,12 @@ void FFModel::map_tensor_with_dim2(ParallelTensor tensor,
     case DT_INT64:
       allocator.allocate_field(sizeof(int64_t), FID_DATA);
       break;
+    case DT_INT4:
+      allocator.allocate_field(sizeof(char), FID_DATA);
+      break;
+    case DT_INT8:
+      allocator.allocate_field(sizeof(char), FID_DATA);
+      break;
     default:
       assert(false);
   }
@@ -1649,8 +2003,10 @@ void FFModel::map_tensor_with_dim2(ParallelTensor tensor,
           runtime->get_logical_partition(ctx, tensor->region_grad, ip);
     }
   }
-  // Step 3: initialize the tensor
-  if (tensor->initializer != NULL) {
+  // Step 3: initialize the tensor; don't randomly initialize weights
+  // for inference
+  if (tensor->initializer != NULL &&
+      config.computationMode == COMP_MODE_TRAINING) {
     tensor->initializer->init(this, tensor);
   }
 }
@@ -1687,6 +2043,7 @@ void FFModel::map_weight_with_dim(ParallelTensor weight,
   switch (parallel_op->op_type) {
     case OP_LINEAR:
     case OP_EMBEDDING:
+    case OP_EXPERTS:
     case OP_MULTIHEAD_ATTENTION: {
       switch (tdim) {
 #define DIMFUNC(TDIM)                                                          \
@@ -2503,9 +2860,10 @@ bool FFModel::apply_fusion(std::vector<Op *> const &operators,
         operators[l]->op_type == OP_WEIGHT) {
       continue;
     }
-    // don't fuse parallel op since they have different parallel_is in
-    // forward/backward
-    if (operators[l]->is_parallel_op()) {
+    // don't fuse parallel op except allReduce since they have different
+    // parallel_is in forward/backward
+    if (operators[l]->is_parallel_op() &&
+        operators[l]->op_type != OP_ALLREDUCE) {
       continue;
     }
     size_t start = 0;
@@ -2548,9 +2906,10 @@ bool FFModel::apply_fusion(std::vector<Op *> const &operators,
               operators[i]->op_type == OP_WEIGHT) {
             continue;
           }
-          // don't fuse parallel op since they have different parallel_is in
-          // forward/backward
-          if (operators[i]->is_parallel_op()) {
+          // don't fuse parallel op except allReduce since they have different
+          // parallel_is in forward/backward
+          if (operators[i]->is_parallel_op() &&
+              operators[i]->op_type != OP_ALLREDUCE) {
             continue;
           }
           fused_op = new FusedOp(*this, operators[i]);
@@ -2635,7 +2994,8 @@ Op *FFModel::create_operator_from_layer(
       assert(tensor->parallel_tensor == nullptr);
       tensor->parallel_tensor = pt;
       // start from data parllel tensor
-      if (config.only_data_parallel) {
+      if (config.only_data_parallel &&
+          config.computationMode == COMP_MODE_TRAINING) {
         Repartition *part = new Repartition(
             *this, pt, num_dims - 1, config.numNodes * config.workersPerNode);
         operators.push_back(part);
@@ -2648,6 +3008,24 @@ Op *FFModel::create_operator_from_layer(
       operators.push_back(op);
       return op;
     }
+    case OP_SPEC_INC_MULTIHEAD_SELF_ATTENTION: {
+      Op *op = SpecIncMultiHeadSelfAttention::create_operator_from_layer(
+          *this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
+    case OP_INC_MULTIHEAD_SELF_ATTENTION: {
+      Op *op = IncMultiHeadSelfAttention::create_operator_from_layer(
+          *this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
+    case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION: {
+      Op *op = TreeIncMultiHeadSelfAttention::create_operator_from_layer(
+          *this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
     case OP_BATCHMATMUL: {
       Op *op = BatchMatmul::create_operator_from_layer(*this, layer, inputs);
       operators.push_back(op);
@@ -2722,6 +3100,11 @@ Op *FFModel::create_operator_from_layer(
       operators.push_back(op);
       return op;
     }
+    case OP_RMS_NORM: {
+      Op *op = RMSNorm::create_operator_from_layer(*this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
     case OP_LINEAR: {
       Op *op = Linear::create_operator_from_layer(*this, layer, inputs);
       operators.push_back(op);
@@ -2762,6 +3145,26 @@ Op *FFModel::create_operator_from_layer(
       operators.push_back(op);
       return op;
     }
+    case OP_ARG_TOPK: {
+      Op *op = ArgTopK::create_operator_from_layer(*this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
+    case OP_BEAM_TOPK: {
+      Op *op = BeamTopK::create_operator_from_layer(*this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
+    case OP_SAMPLING: {
+      Op *op = Sampling::create_operator_from_layer(*this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
+    case OP_ARGMAX: {
+      Op *op = ArgMax::create_operator_from_layer(*this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
     case OP_GROUP_BY: {
       Op *op = Group_by::create_operator_from_layer(*this, layer, inputs);
       operators.push_back(op);
@@ -2777,6 +3180,11 @@ Op *FFModel::create_operator_from_layer(
       operators.push_back(op);
       return op;
     }
+    case OP_EXPERTS: {
+      Op *op = Experts::create_operator_from_layer(*this, layer, inputs);
+      operators.push_back(op);
+      return op;
+    }
     default:
       assert(false);
   }
@@ -2784,7 +3192,9 @@ Op *FFModel::create_operator_from_layer(
 
 void FFModel::create_operators_from_layers() {
   std::map<const Tensor, ParallelTensor> tensors_to_parallel_tensors;
-  for (auto const &l : layers) {
+  // for (auto const &l : layers) {
+  for (int layer_idx = 0; layer_idx < layers.size(); layer_idx++) {
+    auto const &l = layers[layer_idx];
     std::vector<ParallelTensor> inputs;
     for (int i = 0; i < l->numInputs; i++) {
       // create new input tensors
@@ -2792,7 +3202,109 @@ void FFModel::create_operators_from_layers() {
              tensors_to_parallel_tensors.end());
       inputs.push_back(tensors_to_parallel_tensors[l->inputs[i]]);
     }
-    Op *op = create_operator_from_layer(l, inputs);
+    Op *op = nullptr;
+    // add a combine before arg_topk
+    if (config.computationMode == COMP_MODE_INFERENCE &&
+        config.tensor_parallelism_degree > 1 &&
+        (l->op_type == OP_ARG_TOPK || l->op_type == OP_SOFTMAX ||
+         l->op_type == OP_ARGMAX)) {
+      std::vector<ParallelTensor> partitioned_inputs;
+      assert(inputs.size() == 1);
+      Combine *comb = new Combine(*this,
+                                  inputs[0],
+                                  0 /*inner most dim*/,
+                                  config.tensor_parallelism_degree);
+      partitioned_inputs.push_back(comb->outputs[0]);
+      operators.push_back(comb);
+      op = create_operator_from_layer(l, partitioned_inputs);
+    } else {
+      op = create_operator_from_layer(l, inputs);
+    }
+    // add replicate operators after op if needed
+    if (config.computationMode == COMP_MODE_INFERENCE &&
+        config.tensor_parallelism_degree > 1 && l->op_type == OP_EMBEDDING) {
+      assert(op->numOutputs == 1);
+      Replicate *repl = new Replicate(*this,
+                                      op->outputs[0],
+                                      op->outputs[0]->num_dims - 1,
+                                      config.tensor_parallelism_degree);
+      operators.push_back(repl);
+      op = repl;
+    } else if (config.computationMode == COMP_MODE_INFERENCE &&
+               config.tensor_parallelism_degree > 1 &&
+               (l->op_type == OP_INC_MULTIHEAD_SELF_ATTENTION ||
+                l->op_type == OP_TREE_INC_MULTIHEAD_SELF_ATTENTION ||
+                (l->op_type == OP_LINEAR && layer_idx >= 2 &&
+                 layers[layer_idx - 1]->op_type == OP_RELU &&
+                 layers[layer_idx - 2]->op_type == OP_LINEAR) ||
+                (l->op_type == OP_LINEAR && layer_idx >= 5 &&
+                 layers[layer_idx - 1]->op_type == OP_EW_MUL &&
+                 layers[layer_idx - 2]->op_type == OP_EW_MUL &&
+                 layers[layer_idx - 3]->op_type == OP_SIGMOID &&
+                 layers[layer_idx - 4]->op_type == OP_LINEAR &&
+                 layers[layer_idx - 5]->op_type == OP_LINEAR))) {
+      assert(op->numOutputs == 1);
+      AllReduce *allreduce =
+          new AllReduce(*this, op->outputs[0], op->outputs[0]->num_dims - 1);
+      operators.push_back(allreduce);
+      op = allreduce;
+    }
+#ifdef DEADCODE
+    if (config.computationMode == COMP_MODE_INFERENCE &&
+        config.tensor_parallelism_degree > 1 &&
+        (l->op_type == OP_INC_MULTIHEAD_SELF_ATTENTION ||
+         l->op_type == OP_TREE_INC_MULTIHEAD_SELF_ATTENTION ||
+         (l->op_type == OP_LINEAR && layer_idx + 3 <= layers.size() &&
+          layers[layer_idx + 1]->op_type == OP_RELU &&
+          layers[layer_idx + 2]->op_type == OP_LINEAR) ||
+         (l->op_type == OP_LINEAR && layer_idx + 6 <= layers.size() &&
+          layers[layer_idx + 1]->op_type == OP_LINEAR &&
+          layers[layer_idx + 2]->op_type == OP_SIGMOID &&
+          layers[layer_idx + 3]->op_type == OP_EW_MUL &&
+          layers[layer_idx + 4]->op_type == OP_EW_MUL &&
+          layers[layer_idx + 5]->op_type == OP_LINEAR) ||
+         (l->op_type == OP_LINEAR && layer_idx + 5 <= layers.size() &&
+          layer_idx >= 1 && layers[layer_idx - 1]->op_type == OP_LINEAR &&
+          layers[layer_idx + 1]->op_type == OP_SIGMOID &&
+          layers[layer_idx + 2]->op_type == OP_EW_MUL &&
+          layers[layer_idx + 3]->op_type == OP_EW_MUL &&
+          layers[layer_idx + 4]->op_type == OP_LINEAR))) {
+      std::vector<ParallelTensor> partitioned_inputs;
+      assert(inputs.size() == 1);
+      Replicate *repl = new Replicate(*this,
+                                      inputs[0],
+                                      inputs[0]->num_dims - 1,
+                                      config.tensor_parallelism_degree);
+      partitioned_inputs.push_back(repl->outputs[0]);
+      operators.push_back(repl);
+      op = create_operator_from_layer(l, partitioned_inputs);
+    } else {
+      op = create_operator_from_layer(l, inputs);
+    }
+    // Op *op = create_operator_from_layer(l, inputs);
+    //  add reduce operators if needed
+    if (config.computationMode == COMP_MODE_INFERENCE &&
+        config.tensor_parallelism_degree > 1 &&
+        (l->op_type == OP_INC_MULTIHEAD_SELF_ATTENTION ||
+         l->op_type == OP_TREE_INC_MULTIHEAD_SELF_ATTENTION ||
+         (l->op_type == OP_LINEAR && layer_idx >= 2 &&
+          layers[layer_idx - 1]->op_type == OP_RELU &&
+          layers[layer_idx - 2]->op_type == OP_LINEAR) ||
+         (l->op_type == OP_LINEAR && layer_idx >= 5 &&
+          layers[layer_idx - 1]->op_type == OP_EW_MUL &&
+          layers[layer_idx - 2]->op_type == OP_EW_MUL &&
+          layers[layer_idx - 3]->op_type == OP_SIGMOID &&
+          layers[layer_idx - 4]->op_type == OP_LINEAR &&
+          layers[layer_idx - 5]->op_type == OP_LINEAR))) {
+      assert(op->numOutputs == 1);
+      Reduction *reduct = new Reduction(*this,
+                                        op->outputs[0],
+                                        op->outputs[0]->num_dims - 1,
+                                        config.tensor_parallelism_degree);
+      operators.push_back(reduct);
+      op = reduct;
+    }
+#endif
     assert(op->numOutputs == l->numOutputs);
     for (int i = 0; i < op->numOutputs; i++) {
       tensors_to_parallel_tensors[l->outputs[i]] = op->outputs[i];
@@ -2861,6 +3373,7 @@ void FFModel::compile(LossType loss_type,
         ParallelTensor parallel_weight = nullptr;
         for (auto const &op : operators) {
           if (op->layer_guid == layer->layer_guid) {
+            std::cout << "opopop: " << op->name << "\n";
             assert(op->op_type == layer->op_type);
             assert(op->numWeights == layer->numWeights);
             parallel_weight = op->weights[i];
@@ -2920,6 +3433,7 @@ void FFModel::compile(LossType loss_type,
 
   for (size_t l = 0; l < operators.size(); l++) {
     Op *op = operators[l];
+
     for (int i = 0; i < op->numInputs; i++) {
       assert(op->inputs[i]->owner_op != NULL);
     }
@@ -2928,13 +3442,16 @@ void FFModel::compile(LossType loss_type,
       assert(op->weights[i]->region != LogicalRegion::NO_REGION);
       parameters.push_back(op->weights[i]);
     }
+
     op->map_output_tensors(*this);
     // for (int i = 0; i < op->numOutputs; i++) {
     //   // Output tensor
     //   map_tensor(op->outputs[i], op);
     // }
-    if (op->is_parallel_op()) {
-      ((ParallelOp *)op)->create_input_partition(*this);
+    if (config.computationMode == COMP_MODE_TRAINING) {
+      if (op->is_parallel_op()) {
+        ((ParallelOp *)op)->create_input_partition(*this);
+      }
     }
     // op->map_output_tensors(*this);
   }
@@ -3035,7 +3552,7 @@ void FFModel::compile(LossType loss_type,
              operators[i]->op_guid);
       for (int j = 0; j < op->numInputs; j++) {
         LogicalRegion handle = op->inputs[j]->region;
-        printf("inputs[%d] region(%d,%d,%d)\n",
+        printf("\tinputs[%d] region(%d,%d,%d)\n",
                j,
                handle.get_index_space().get_id(),
                handle.get_field_space().get_id(),
@@ -3043,7 +3560,7 @@ void FFModel::compile(LossType loss_type,
       }
       for (int j = 0; j < op->numOutputs; j++) {
         LogicalRegion handle = op->outputs[j]->region;
-        printf("outputs[%d] region(%d,%d,%d)\n",
+        printf("\toutputs[%d] region(%d,%d,%d)\n",
                j,
                handle.get_index_space().get_id(),
                handle.get_field_space().get_id(),
@@ -3051,7 +3568,7 @@ void FFModel::compile(LossType loss_type,
       }
       for (int j = 0; j < op->numWeights; j++) {
         LogicalRegion handle = op->weights[j]->region;
-        printf("weights[%d] region(%d,%d,%d)\n",
+        printf("\tweights[%d] region(%d,%d,%d)\n",
                j,
                handle.get_index_space().get_id(),
                handle.get_field_space().get_id(),
@@ -3064,22 +3581,22 @@ void FFModel::compile(LossType loss_type,
   assert(final_operator->numOutputs == 1);
   for (size_t i = 0; i < operators.size(); i++) {
     Op *op = operators[i];
-    printf("operator[%zu]: type(%d)\n", i, operators[i]->op_type);
+    log_model.print("operator[%zu]: type(%d)", i, operators[i]->op_type);
     for (int j = 0; j < op->numInputs; j++) {
       LogicalRegion handle = op->inputs[j]->region;
-      printf("inputs[%d] region(%d,%d,%d)\n",
-             j,
-             handle.get_index_space().get_id(),
-             handle.get_field_space().get_id(),
-             handle.get_tree_id());
+      log_model.print("\tinputs[%d] region(%d,%d,%d)",
+                      j,
+                      handle.get_index_space().get_id(),
+                      handle.get_field_space().get_id(),
+                      handle.get_tree_id());
     }
     for (int j = 0; j < op->numOutputs; j++) {
       LogicalRegion handle = op->outputs[j]->region;
-      printf("outputs[%d] region(%d,%d,%d)\n",
-             j,
-             handle.get_index_space().get_id(),
-             handle.get_field_space().get_id(),
-             handle.get_tree_id());
+      log_model.print("\toutputs[%d] region(%d,%d,%d)",
+                      j,
+                      handle.get_index_space().get_id(),
+                      handle.get_field_space().get_id(),
+                      handle.get_tree_id());
     }
   }
   // assert(final_operator->outputs[0].num_dims == 2);
@@ -3122,18 +3639,17 @@ void FFModel::compile(LossType loss_type,
       assert(false && "Unsupported dim");
     }
   }
-  // init optimizer
-  assert(optimizer != NULL);
-  optimizer->init();
+  if (config.computationMode == COMP_MODE_TRAINING) {
+    // init optimizer
+    assert(optimizer != NULL);
+    optimizer->init();
+  }
 
 #ifdef FF_USE_NCCL
-  if (config.computationMode == COMP_MODE_TRAINING) {
-    // init all nccl communicators
-    for (size_t l = 0; l < operators.size(); l++) {
-      // Only create nccl for weights
-      if (operators[l]->op_type != OP_WEIGHT) {
-        continue;
-      }
+  for (size_t l = 0; l < operators.size(); l++) {
+    // Only create nccl for weights in training
+    if ((operators[l]->op_type == OP_WEIGHT &&
+         config.computationMode == COMP_MODE_TRAINING)) {
       MachineView view = operators[l]->outputs[0]->machine_view;
       if (view_hash_to_nccl_comms.find(view.hash()) ==
           view_hash_to_nccl_comms.end()) {
@@ -3480,10 +3996,13 @@ struct DefaultConfig {
   const static int cpusPerNode = 0;
   const static size_t searchBudget = -1;
   const static size_t simulatorWorkSpaceSize =
-      (size_t)2 * 1024 * 1024 * 1024; // 2GB
+      (size_t)2 * 1024 * 1024 * 1024; // 2 GB
   constexpr static float searchAlpha = 1.2f;
   const static bool searchOverlapBackwardUpdate = false;
-  const static bool onlyDataParallel = false;
+  const static size_t offloadReserveSpaceSize =
+      (size_t)8 * 1024 * 1024 * 1024; // 8 GB
+  const static bool cpuOffload = false;
+  const static bool onlyDataParallel = true;
   const static bool enableSampleParallel = true;
   const static bool enableParameterParallel = false;
   const static bool enableAttributeParallel = false;
@@ -3514,7 +4033,13 @@ FFConfig::FFConfig() {
   search_alpha = DefaultConfig::searchAlpha;
   search_overlap_backward_update = DefaultConfig::searchOverlapBackwardUpdate;
   computationMode = COMP_MODE_TRAINING;
+  cpu_offload = DefaultConfig::cpuOffload;
+  offload_reserve_space_size = DefaultConfig::offloadReserveSpaceSize;
+  quantization_type = DT_NONE;
   only_data_parallel = DefaultConfig::onlyDataParallel;
+  data_parallelism_degree = 1;
+  tensor_parallelism_degree = 1;
+  pipeline_parallelism_degree = 1;
   enable_sample_parallel = DefaultConfig::enableSampleParallel;
   enable_parameter_parallel = DefaultConfig::enableParameterParallel;
   enable_attribute_parallel = DefaultConfig::enableAttributeParallel;
@@ -3560,7 +4085,7 @@ FFConfig::FFConfig() {
   Runtime *runtime = Runtime::get_runtime();
   lg_hlr = runtime;
   lg_ctx = Runtime::get_context();
-  field_space = runtime->create_field_space(lg_ctx);
+  // field_space = runtime->create_field_space(lg_ctx);
 }
 
 void FFConfig::parse_args(char **argv, int argc) {
@@ -3616,10 +4141,41 @@ void FFConfig::parse_args(char **argv, int argc) {
       export_strategy_file = std::string(argv[++i]);
       continue;
     }
+    if ((!strcmp(argv[i], "-offload"))) {
+      cpu_offload = true;
+      continue;
+    }
+    if (!strcmp(argv[i], "-offload-reserve-space-size")) {
+      offload_reserve_space_size = atoll(argv[++i]) * 1024 * 1024;
+      continue;
+    }
+    if ((!strcmp(argv[i], "--4bit-quantization"))) {
+      quantization_type = DT_INT4;
+      continue;
+    }
+    if ((!strcmp(argv[i], "--8bit-quantization"))) {
+      quantization_type = DT_INT8;
+      continue;
+    }
     if ((!strcmp(argv[i], "--only-data-parallel"))) {
       only_data_parallel = true;
       continue;
     }
+    // data parallelism degree
+    if (!strcmp(argv[i], "-data-parallelism-degree")) {
+      data_parallelism_degree = std::stoi(argv[++i]);
+      continue;
+    }
+    // tensor parallelism degree
+    if (!strcmp(argv[i], "-tensor-parallelism-degree")) {
+      tensor_parallelism_degree = std::stoi(argv[++i]);
+      continue;
+    }
+    // pipeline parallelism degree
+    if (!strcmp(argv[i], "-pipeline-parallelism-degree")) {
+      pipeline_parallelism_degree = std::stoi(argv[++i]);
+      continue;
+    }
     if ((!strcmp(argv[i], "--enable-parameter-parallel"))) {
       enable_parameter_parallel = true;
       continue;
@@ -3752,56 +4308,174 @@ void register_flexflow_internal_tasks(Runtime *runtime,
           registrar);
     }
   }
-  // ElementUnary task
+  // RequestManager load_tokens
   {
-    TaskVariantRegistrar registrar(ELEMENTUNARY_INIT_TASK_ID,
-                                   "ElementWiseUnary Init");
+    TaskVariantRegistrar registrar(RM_LOAD_TOKENS_TASK_ID,
+                                   "RequestManager Load Tokens");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
     registrar.set_leaf();
     if (pre_register) {
-      Runtime::preregister_task_variant<OpMeta *, ElementUnary::init_task>(
-          registrar, "ElementWiseUnary Init Task");
+      Runtime::preregister_task_variant<RequestManager::load_tokens_task>(
+          registrar, "RequestManager Load Tokens Task");
     } else {
       if (enable_control_replication) {
         registrar.global_registration = false;
       }
-      runtime->register_task_variant<OpMeta *, ElementUnary::init_task>(
+      runtime->register_task_variant<RequestManager::load_tokens_task>(
           registrar);
     }
   }
+  // RequestManager load position tokens
   {
-    TaskVariantRegistrar registrar(ELEMENTUNARY_FWD_TASK_ID,
-                                   "ElementWiseUnary Forward");
+    TaskVariantRegistrar registrar(RM_LOAD_POSITION_TASK_ID,
+                                   "RequestManager Load Position tokens");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
     registrar.set_leaf();
     if (pre_register) {
-      Runtime::preregister_task_variant<ElementUnary::forward_task>(
-          registrar, "ElementWiseUnary Forward Task");
+      Runtime::preregister_task_variant<RequestManager::load_positions_task>(
+          registrar, "RequestManager Load Position Tokens Task");
     } else {
       if (enable_control_replication) {
         registrar.global_registration = false;
       }
-      runtime->register_task_variant<ElementUnary::forward_task>(registrar);
+      runtime->register_task_variant<RequestManager::load_positions_task>(
+          registrar);
     }
   }
+  // RequestManager prepare_next_batch
   {
-    TaskVariantRegistrar registrar(ELEMENTUNARY_BWD_TASK_ID,
-                                   "ElementWiseUnary Backward");
-    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    TaskVariantRegistrar registrar(RM_PREPARE_NEXT_BATCH_TASK_ID,
+                                   "RequestManager Prepare Next Batch");
+    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
     registrar.set_leaf();
     if (pre_register) {
-      Runtime::preregister_task_variant<ElementUnary::backward_task>(
-          registrar, "ElementWiseUnary Backward Task");
+      Runtime::preregister_task_variant<
+          BatchConfig,
+          RequestManager::prepare_next_batch_task>(
+          registrar, "RequestManager Prepare Next Batch Task");
     } else {
       if (enable_control_replication) {
         registrar.global_registration = false;
       }
-      runtime->register_task_variant<ElementUnary::backward_task>(registrar);
+      runtime->register_task_variant<BatchConfig,
+                                     RequestManager::prepare_next_batch_task>(
+          registrar);
     }
   }
-  // ElementBinary task
+  // RequestManager prepare_next_batch_beam
   {
-    TaskVariantRegistrar registrar(ELEMENTBINARY_INIT_TASK_ID,
+    TaskVariantRegistrar registrar(RM_PREPARE_NEXT_BATCH_BEAM_TASK_ID,
+                                   "RequestManager Prepare Next Batch (Beam)");
+    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          BeamSearchBatchConfig,
+          RequestManager::prepare_next_batch_beam_task>(
+          registrar, "RequestManager Prepare Next Batch (Beam) Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime
+          ->register_task_variant<BeamSearchBatchConfig,
+                                  RequestManager::prepare_next_batch_beam_task>(
+              registrar);
+    }
+  }
+  // RequestManager prepare_next_batch_init
+  {
+    TaskVariantRegistrar registrar(
+        RM_PREPARE_NEXT_BATCH_INIT_TASK_ID,
+        "RequestManager Prepare Next Batch (Init Beam)");
+    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          BeamSearchBatchConfig,
+          RequestManager::prepare_next_batch_init_task>(
+          registrar, "RequestManager Prepare Next Batch (Init Beam) Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime
+          ->register_task_variant<BeamSearchBatchConfig,
+                                  RequestManager::prepare_next_batch_init_task>(
+              registrar);
+    }
+  }
+  // RequestManager prepare_next_batch_verify
+  {
+    TaskVariantRegistrar registrar(
+        RM_PREPARE_NEXT_BATCH_VERIFY_TASK_ID,
+        "RequestManager Prepare Next Batch (Verify)");
+    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          TreeVerifyBatchConfig,
+          RequestManager::prepare_next_batch_verify_task>(
+          registrar, "RequestManager Prepare Next Batch (Verify) Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<
+          TreeVerifyBatchConfig,
+          RequestManager::prepare_next_batch_verify_task>(registrar);
+    }
+  }
+  // ElementUnary task
+  {
+    TaskVariantRegistrar registrar(ELEMENTUNARY_INIT_TASK_ID,
+                                   "ElementWiseUnary Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, ElementUnary::init_task>(
+          registrar, "ElementWiseUnary Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, ElementUnary::init_task>(
+          registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ELEMENTUNARY_FWD_TASK_ID,
+                                   "ElementWiseUnary Forward");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<ElementUnary::forward_task>(
+          registrar, "ElementWiseUnary Forward Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<ElementUnary::forward_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ELEMENTUNARY_BWD_TASK_ID,
+                                   "ElementWiseUnary Backward");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<ElementUnary::backward_task>(
+          registrar, "ElementWiseUnary Backward Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<ElementUnary::backward_task>(registrar);
+    }
+  }
+  // ElementBinary task
+  {
+    TaskVariantRegistrar registrar(ELEMENTBINARY_INIT_TASK_ID,
                                    "ElementWiseBinary Init");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
     registrar.set_leaf();
@@ -3846,6 +4520,63 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<ElementBinary::backward_task>(registrar);
     }
   }
+  // Experts
+  {
+    TaskVariantRegistrar registrar(EXPERTS_INIT_TASK_ID, "Experts Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, Experts::init_task>(
+          registrar, "Experts Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, Experts::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(EXPERTS_FWD_TASK_ID, "Experts Forward");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<Experts::forward_task>(
+          registrar, "Experts Forward Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<Experts::forward_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(EXPERTS_BWD_TASK_ID, "Experts Backward");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<Experts::backward_task>(
+          registrar, "Experts Backward Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<Experts::backward_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(EXPERTS_INF_TASK_ID, "Experts Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<Experts::inference_task>(
+          registrar, "Experts Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<Experts::inference_task>(registrar);
+    }
+  }
   // Cast
   {
     TaskVariantRegistrar registrar(CAST_INIT_TASK_ID, "Cast Init");
@@ -4426,6 +5157,35 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<LayerNorm::forward_task>(registrar);
     }
   }
+  // rms norm task
+  {
+    TaskVariantRegistrar registrar(RMSNROM_INIT_TASK_ID, "rmsnorm_init_task");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, RMSNorm::init_task>(
+          registrar, "rmsnorm_init_task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, RMSNorm::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(RMSNROM_FWD_TASK_ID, "rmsnorm_fwd_task");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<RMSNorm::forward_task>(
+          registrar, "rmsnorm_fwd_task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<RMSNorm::forward_task>(registrar);
+    }
+  }
   {
     TaskVariantRegistrar registrar(LAYERNORM_BWD_TASK_ID, "layernorm_bwd_task");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
@@ -4455,6 +5215,20 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<OpMeta *, Linear::init_task>(registrar);
     }
   }
+  {
+    TaskVariantRegistrar registrar(LINEAR_INF_TASK_ID, "Linear Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<Linear::inference_task>(
+          registrar, "Linear Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<Linear::inference_task>(registrar);
+    }
+  }
   {
     TaskVariantRegistrar registrar(LINEAR_FWD_TASK_ID, "Linear Forward");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
@@ -4569,6 +5343,22 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<Softmax::backward_task>(registrar);
     }
   }
+  {
+    TaskVariantRegistrar registrar(SOFTMAX_INF_TASK_ID, "softmax_inf_task");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<InferenceResult,
+                                        Softmax::inference_task>(
+          registrar, "softmax_inf_task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<InferenceResult, Softmax::inference_task>(
+          registrar);
+    }
+  }
   // compute Loss
   {
     TaskVariantRegistrar registrar(LOSS_BWD_TASK_ID, "Loss Backward");
@@ -4883,6 +5673,149 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<TopK::backward_task>(registrar);
     }
   }
+  // ArgTopk task
+  {
+    TaskVariantRegistrar registrar(ARG_TOPK_INIT_TASK_ID, "ArgTopK Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, ArgTopK::init_task>(
+          registrar, "ArgTopK Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, ArgTopK::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ARG_TOPK_INF_TASK_ID, "ArgTopK Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<InferenceResult,
+                                        ArgTopK::inference_task>(
+          registrar, "ArgTopK Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<InferenceResult, ArgTopK::inference_task>(
+          registrar);
+    }
+  }
+  // BeamTopk task
+  {
+    TaskVariantRegistrar registrar(BEAM_TOPK_INIT_TASK_ID, "BeamTopK Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, BeamTopK::init_task>(
+          registrar, "BeamTopK Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, BeamTopK::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(BEAM_TOPK_INF_TASK_ID, "BeamTopK Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<BeamInferenceResult,
+                                        BeamTopK::inference_task>(
+          registrar, "BeamTopK Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<BeamInferenceResult,
+                                     BeamTopK::inference_task>(registrar);
+    }
+  }
+  // Sampling task
+  {
+    TaskVariantRegistrar registrar(SAMPLING_INIT_TASK_ID, "Sampling Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, Sampling::init_task>(
+          registrar, "Sampling Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, Sampling::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(SAMPLING_INF_TASK_ID, "Sampling Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<InferenceResult,
+                                        Sampling::inference_task>(
+          registrar, "Sampling Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<InferenceResult, Sampling::inference_task>(
+          registrar);
+    }
+  }
+  // ArgMax task
+  {
+    TaskVariantRegistrar registrar(ARGMAX_INIT_TASK_ID, "ArgMax Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, ArgMax::init_task>(
+          registrar, "ArgMax Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, ArgMax::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ARGMAX_BEAM_INF_TASK_ID,
+                                   "ArgMax Beam Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<BeamInferenceResult,
+                                        ArgMax::inference_task_beam>(
+          registrar, "ArgMax Inference Task Beam");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<BeamInferenceResult,
+                                     ArgMax::inference_task_beam>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ARGMAX_NORM_INF_TASK_ID,
+                                   "ArgMax Norm Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<InferenceResult,
+                                        ArgMax::inference_task_norm>(
+          registrar, "ArgMax Inference Task Norm");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime
+          ->register_task_variant<InferenceResult, ArgMax::inference_task_norm>(
+              registrar);
+    }
+  }
   // Transpose task
   {
     TaskVariantRegistrar registrar(TRANSPOSE_INIT_TASK_ID, "Transpose Init");
@@ -4976,6 +5909,119 @@ void register_flexflow_internal_tasks(Runtime *runtime,
           registrar);
     }
   }
+  // MultiHeadAttention task
+  {
+    TaskVariantRegistrar registrar(INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+                                   "IncMultiHeadSelfAttention Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *,
+                                        IncMultiHeadSelfAttention::init_task>(
+          registrar, "IncMultiHeadSelfAttention Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *,
+                                     IncMultiHeadSelfAttention::init_task>(
+          registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+                                   "IncMultiHeadSelfAttention Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          IncMultiHeadSelfAttention::inference_task>(
+          registrar, "IncMultiHeadSelfAttention Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<IncMultiHeadSelfAttention::inference_task>(
+          registrar);
+    }
+  }
+  // speculative MultiHeadAttention task
+  {
+    TaskVariantRegistrar registrar(
+        SPEC_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+        "Speculative IncMultiHeadSelfAttention Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          OpMeta *,
+          SpecIncMultiHeadSelfAttention::init_task>(
+          registrar, "Speculative IncMultiHeadSelfAttention Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *,
+                                     SpecIncMultiHeadSelfAttention::init_task>(
+          registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(
+        SPEC_INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+        "Speculative IncMultiHeadSelfAttention Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          SpecIncMultiHeadSelfAttention::inference_task>(
+          registrar, "Speculative IncMultiHeadSelfAttention Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<
+          SpecIncMultiHeadSelfAttention::inference_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(
+        TREE_INC_MULTIHEAD_SELF_ATTENTION_INIT_TASK_ID,
+        "TreeIncMultiHeadSelfAttention Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          OpMeta *,
+          TreeIncMultiHeadSelfAttention::init_task>(
+          registrar, "TreeIncMultiHeadSelfAttention Init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *,
+                                     TreeIncMultiHeadSelfAttention::init_task>(
+          registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(
+        TREE_INC_MULTIHEAD_SELF_ATTENTION_INF_TASK_ID,
+        "TreeIncMultiHeadSelfAttention Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<
+          TreeIncMultiHeadSelfAttention::inference_task>(
+          registrar, "TreeIncMultiHeadSelfAttention Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<
+          TreeIncMultiHeadSelfAttention::inference_task>(registrar);
+    }
+  }
   // NoOp
   {
     TaskVariantRegistrar registrar(NOOP_INIT_TASK_ID, "Weight NCCL Init");
@@ -5020,6 +6066,20 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<FusedOp::forward_task>(registrar);
     }
   }
+  {
+    TaskVariantRegistrar registrar(FUSEDOP_INF_TASK_ID, "FusedOp Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<FusedOp::inference_task>(
+          registrar, "FusedOp Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<FusedOp::inference_task>(registrar);
+    }
+  }
   {
     TaskVariantRegistrar registrar(FUSEDOP_BWD_TASK_ID, "FusedOp Backward");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
@@ -5126,6 +6186,20 @@ void register_flexflow_internal_tasks(Runtime *runtime,
     }
   }
   // Replicate
+  {
+    TaskVariantRegistrar registrar(REPLICATE_INIT_TASK_ID, "Replicate Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, Replicate::init_task>(
+          registrar, "Replicate init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, Replicate::init_task>(registrar);
+    }
+  }
   {
     TaskVariantRegistrar registrar(REPLICATE_FWD_TASK_ID, "Replicate Forward");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
@@ -5155,6 +6229,20 @@ void register_flexflow_internal_tasks(Runtime *runtime,
     }
   }
   // Reduction
+  {
+    TaskVariantRegistrar registrar(REDUCTION_INIT_TASK_ID, "Reduction Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, Reduction::init_task>(
+          registrar, "Reduction init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, Reduction::init_task>(registrar);
+    }
+  }
   {
     TaskVariantRegistrar registrar(REDUCTION_FWD_TASK_ID, "Reduction Forward");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
@@ -5183,6 +6271,64 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<Reduction::backward_task>(registrar);
     }
   }
+  // AllReduce
+  {
+    TaskVariantRegistrar registrar(ALLREDUCE_INIT_TASK_ID, "AllReduce Init");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<OpMeta *, AllReduce::init_task>(
+          registrar, "AllReduce init Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<OpMeta *, AllReduce::init_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ALLREDUCE_INF_TASK_ID,
+                                   "AllReduce Inference");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<AllReduce::inference_task>(
+          registrar, "AllReduce Inference Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<AllReduce::inference_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ALLREDUCE_FWD_TASK_ID, "AllReduce Forward");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<AllReduce::forward_task>(
+          registrar, "AllReduce Forward Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<AllReduce::forward_task>(registrar);
+    }
+  }
+  {
+    TaskVariantRegistrar registrar(ALLREDUCE_BWD_TASK_ID, "AllReduce Backward");
+    registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<AllReduce::backward_task>(
+          registrar, "AllReduce Backward Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime->register_task_variant<AllReduce::backward_task>(registrar);
+    }
+  }
   // FusedParallelOp
   {
     TaskVariantRegistrar registrar(FUSED_PARALLELOP_FWD_TASK_ID,
@@ -5414,12 +6560,12 @@ void register_flexflow_internal_tasks(Runtime *runtime,
 #endif
   // Search
   {
-    TaskVariantRegistrar registrar(STRATEGY_SEARCH_TASK_ID, "Stretegy Search");
+    TaskVariantRegistrar registrar(STRATEGY_SEARCH_TASK_ID, "Strategy Search");
     registrar.add_constraint(ProcessorConstraint(Processor::TOC_PROC));
     registrar.set_leaf();
     if (pre_register) {
       Runtime::preregister_task_variant<Simulator::strategy_search_task>(
-          registrar, "Stretegy Search Task");
+          registrar, "Strategy Search Task");
     } else {
       if (enable_control_replication) {
         registrar.global_registration = false;
@@ -5461,6 +6607,24 @@ void register_flexflow_internal_tasks(Runtime *runtime,
       runtime->register_task_variant<UtilityTasks::dummy_task>(registrar);
     }
   }
+  // Tensor Equal task
+  {
+    TaskVariantRegistrar registrar(TENSOR_EQUAL_TASK_ID, "Tensor Equal");
+    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
+    registrar.set_leaf();
+    if (pre_register) {
+      Runtime::preregister_task_variant<bool,
+                                        ParallelTensorBase::tensor_equal_task>(
+          registrar, "Tensor Equal Task");
+    } else {
+      if (enable_control_replication) {
+        registrar.global_registration = false;
+      }
+      runtime
+          ->register_task_variant<bool, ParallelTensorBase::tensor_equal_task>(
+              registrar);
+    }
+  }
 }
 
 // template instantiations
diff --git a/src/runtime/model.cu b/src/runtime/model.cu
index e07a7465a9..17401a0f14 100644
--- a/src/runtime/model.cu
+++ b/src/runtime/model.cu
@@ -86,6 +86,8 @@ FFHandler
   printf("workSpaceSize (%zu MB)\n", info->workSpaceSize / 1024 / 1024);
   FFHandler handle;
   handle.workSpaceSize = info->workSpaceSize;
+  handle.offload_reserve_space_size = info->offload_reserve_space_size;
+  handle.quantization_type = info->quantization_type;
   handle.allowTensorOpMathConversion = info->allowTensorOpMathConversion;
   checkCUDA(cublasCreate(&handle.blas));
   if (handle.allowTensorOpMathConversion) {
@@ -125,6 +127,31 @@ FFHandler
         .wait();
     handle.workSpace = workspaceInst.pointer_untyped(0, sizeof(char));
   }
+  if (handle.offload_reserve_space_size > 0) {
+    // allocate memory for offload reserve space
+    Memory gpu_mem = Machine::MemoryQuery(Machine::get_machine())
+                         .only_kind(Memory::GPU_FB_MEM)
+                         .best_affinity_to(task->target_proc)
+                         .first();
+    Realm::Rect<1, coord_t> bounds(
+        Realm::Point<1, coord_t>(0),
+        Realm::Point<1, coord_t>(handle.offload_reserve_space_size - 1));
+    std::vector<size_t> field_sizes;
+    field_sizes.push_back(sizeof(char));
+    Realm::RegionInstance workspaceInst;
+    Realm::RegionInstance::create_instance(workspaceInst,
+                                           gpu_mem,
+                                           bounds,
+                                           field_sizes,
+                                           0,
+                                           Realm::ProfilingRequestSet())
+        .wait();
+    handle.offload_reserve_space =
+        workspaceInst.pointer_untyped(0, sizeof(char));
+  } else {
+    handle.offload_reserve_space = nullptr;
+  }
+
   // checkCUDA(cudaMalloc(&handle.workSpace, handle.workSpaceSize));
 #ifdef FF_USE_NCCL
   handle.ncclComm = NULL;
diff --git a/src/runtime/operator_params.cc b/src/runtime/operator_params.cc
index 41dd37dec7..5f9ae98936 100644
--- a/src/runtime/operator_params.cc
+++ b/src/runtime/operator_params.cc
@@ -1,9 +1,12 @@
 #include "flexflow/operator_params.h"
 #include "flexflow/ops/aggregate.h"
 #include "flexflow/ops/aggregate_spec.h"
+#include "flexflow/ops/arg_topk.h"
+#include "flexflow/ops/argmax.h"
 #include "flexflow/ops/attention.h"
 #include "flexflow/ops/batch_matmul.h"
 #include "flexflow/ops/batch_norm.h"
+#include "flexflow/ops/beam_topk.h"
 #include "flexflow/ops/cache.h"
 #include "flexflow/ops/cast.h"
 #include "flexflow/ops/concat.h"
@@ -15,6 +18,7 @@
 #include "flexflow/ops/flat.h"
 #include "flexflow/ops/gather.h"
 #include "flexflow/ops/groupby.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
 #include "flexflow/ops/layer_norm.h"
 #include "flexflow/ops/linear.h"
 #include "flexflow/ops/mean.h"
@@ -23,10 +27,15 @@
 #include "flexflow/ops/reduce.h"
 #include "flexflow/ops/reshape.h"
 #include "flexflow/ops/reverse.h"
+#include "flexflow/ops/rms_norm.h"
+#include "flexflow/ops/sampling.h"
 #include "flexflow/ops/softmax.h"
+#include "flexflow/ops/spec_inc_multihead_self_attention.h"
 #include "flexflow/ops/split.h"
 #include "flexflow/ops/topk.h"
 #include "flexflow/ops/transpose.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/parallel_ops/allreduce.h"
 #include "flexflow/parallel_ops/combine.h"
 #include "flexflow/parallel_ops/fused_parallel_op.h"
 #include "flexflow/parallel_ops/partition.h"
@@ -78,6 +87,10 @@ tl::optional<OperatorParameters> get_op_parameters(Op const *op) {
       return ((Gather *)op)->get_params();
     case OP_MULTIHEAD_ATTENTION:
       return ((MultiHeadAttention *)op)->get_params();
+    case OP_INC_MULTIHEAD_SELF_ATTENTION:
+      return ((IncMultiHeadSelfAttention *)op)->get_params();
+    case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION:
+      return ((TreeIncMultiHeadSelfAttention *)op)->get_params();
     case OP_LAYERNORM:
       return ((LayerNorm *)op)->get_params();
     case OP_REDUCE_SUM:
@@ -94,6 +107,8 @@ tl::optional<OperatorParameters> get_op_parameters(Op const *op) {
       return ((Reduction *)op)->get_params();
     case OP_COMBINE:
       return ((Combine *)op)->get_params();
+    case OP_ALLREDUCE:
+      return ((AllReduce *)op)->get_params();
     case OP_FUSED_PARALLEL:
       return ((FusedParallelOp *)op)->get_params();
     case OP_TRANSPOSE:
@@ -110,6 +125,16 @@ tl::optional<OperatorParameters> get_op_parameters(Op const *op) {
       return ((Aggregate *)op)->get_params();
     case OP_AGG_SPEC:
       return ((AggregateSpec *)op)->get_params();
+    case OP_RMS_NORM:
+      return ((RMSNorm *)op)->get_params();
+    case OP_ARG_TOPK:
+      return ((ArgTopK *)op)->get_params();
+    case OP_BEAM_TOPK:
+      return ((BeamTopK *)op)->get_params();
+    case OP_SAMPLING:
+      return ((Sampling *)op)->get_params();
+    case OP_ARGMAX:
+      return ((ArgMax *)op)->get_params();
 
       // TODO: implement the get_params() function for the operators below and
       // uncomment the lines below
diff --git a/src/runtime/parallel_tensor.cc b/src/runtime/parallel_tensor.cc
index 963ad8af73..8f1be15fd1 100644
--- a/src/runtime/parallel_tensor.cc
+++ b/src/runtime/parallel_tensor.cc
@@ -1,3 +1,4 @@
+#include "flexflow/ffconst_utils.h"
 #include "flexflow/model.h"
 #include "flexflow/ops/attention.h"
 #include "flexflow/ops/concat.h"
@@ -655,10 +656,15 @@ bool ParallelTensorBase::set_tensor(FFModel const *ff,
   // TODO: check data type matches
   // TODO: Currently we use a task launch, change to index launch for NCCL
   // parameter
-  size_t volume = 1, num_replicas = 0;
+  size_t volume = 1, num_replicas = 1;
   if (sync_type == ParameterSyncType::NCCL) {
-    Domain domain = runtime->get_index_space_domain(ctx, parallel_is);
-    num_replicas = domain.get_volume();
+    // Domain domain = runtime->get_index_space_domain(ctx, parallel_is);
+    // num_replicas = domain.get_volume();
+    for (int i = 0; i < this->num_dims; i++) {
+      if (this->dims[i].is_replica_dim) {
+        num_replicas *= this->dims[i].size;
+      }
+    }
   } else if (sync_type == ParameterSyncType::PS) {
     num_replicas = 1;
   } else {
@@ -667,7 +673,7 @@ bool ParallelTensorBase::set_tensor(FFModel const *ff,
   for (size_t i = 0; i < dim_sizes.size(); i++) {
     volume = volume * dim_sizes[i];
   }
-  RegionRequirement req(region, READ_WRITE, EXCLUSIVE, region);
+  RegionRequirement req(region, WRITE_ONLY, EXCLUSIVE, region);
   req.add_field(FID_DATA);
   InlineLauncher launcher(req);
   PhysicalRegion pr = runtime->map_region(ctx, launcher);
@@ -675,7 +681,7 @@ bool ParallelTensorBase::set_tensor(FFModel const *ff,
   switch (num_dims) {
 #define DIMFUNC(DIM)                                                           \
   case DIM: {                                                                  \
-    TensorAccessorW<T, DIM> acc(pr, req, FID_DATA, ctx, runtime, true);        \
+    TensorAccessorW<T, DIM> acc(pr, req, FID_DATA, ctx, runtime, false);       \
     assert(acc.rect.volume() == volume * num_replicas);                        \
     T *ptr = acc.ptr;                                                          \
     for (size_t i = 0; i < num_replicas; i++) {                                \
@@ -747,6 +753,65 @@ bool ParallelTensorBase::get_tensor(FFModel const *ff,
   return true;
 }
 
+template <typename T>
+bool ParallelTensorBase::tensor_equal(FFConfig &config,
+                                      ParallelTensorBase &tensor) {
+  Context ctx = config.lg_ctx;
+  Runtime *runtime = config.lg_hlr;
+  TaskLauncher launcher(TENSOR_EQUAL_TASK_ID,
+                        TaskArgument(&num_dims, sizeof(num_dims)));
+  launcher.add_region_requirement(
+      RegionRequirement(region, READ_ONLY, EXCLUSIVE, region));
+  launcher.add_field(0, FID_DATA);
+  launcher.add_region_requirement(
+      RegionRequirement(tensor.region, READ_ONLY, EXCLUSIVE, tensor.region));
+  launcher.add_field(1, FID_DATA);
+  Future result = runtime->execute_task(ctx, launcher);
+  bool equals = result.get_result<bool>();
+  return equals;
+}
+
+bool ParallelTensorBase::tensor_equal_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(regions.size() == 2);
+  int dim = *(int const *)task->args;
+  switch (dim) {
+#define DIMFUNC(DIM)                                                           \
+  case DIM:                                                                    \
+    return tensor_equal_task_with_dim<DIM>(task, regions, ctx, runtime);
+    LEGION_FOREACH_N(DIMFUNC)
+#undef DIMFUNC
+    default:
+      assert(false);
+  }
+  assert(false);
+}
+
+template <int NDIM>
+bool ParallelTensorBase::tensor_equal_task_with_dim(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  TensorAccessorR<float, NDIM> acc1(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  TensorAccessorR<float, NDIM> acc2(
+      regions[1], task->regions[1], FID_DATA, ctx, runtime);
+  float const *data1 = acc1.ptr;
+  float const *data2 = acc2.ptr;
+  bool equal = true;
+  for (int i = 0; i < acc1.rect.volume(); i++) {
+    if (data1[i] != data2[i]) {
+      equal = false;
+      break;
+    }
+  }
+  return equal;
+}
+
 template float *ParallelTensorBase::get_raw_ptr<float>(FFConfig &config);
 template int32_t *ParallelTensorBase::get_raw_ptr<int32_t>(FFConfig &config);
 
@@ -775,6 +840,20 @@ template bool TensorBase::get_tensor<int64_t>(FFModel const *ff,
                                               int64_t *data,
                                               bool get_gradients);
 
+template bool ParallelTensorBase::set_tensor<half>(FFModel const *ff,
+                                                   std::vector<int> const &dims,
+                                                   half const *data);
+template bool ParallelTensorBase::get_tensor<half>(FFModel const *ff,
+                                                   half *data,
+                                                   bool get_gradients);
+
+template bool ParallelTensorBase::set_tensor<char>(FFModel const *ff,
+                                                   std::vector<int> const &dims,
+                                                   char const *data);
+template bool ParallelTensorBase::get_tensor<char>(FFModel const *ff,
+                                                   char *data,
+                                                   bool get_gradients);
+
 template bool ParallelTensorBase::set_tensor<float>(
     FFModel const *ff, std::vector<int> const &dims, float const *data);
 template bool ParallelTensorBase::get_tensor<float>(FFModel const *ff,
@@ -796,6 +875,10 @@ template bool ParallelTensorBase::get_tensor<int64_t>(FFModel const *ff,
                                                       int64_t *data,
                                                       bool get_gradients);
 
+template bool
+    ParallelTensorBase::tensor_equal<float>(FFConfig &config,
+                                            ParallelTensorBase &tensor);
+
 template bool TensorBase::get_output_parallel_tensor<float>(FFModel const *ff,
                                                             float *data,
                                                             bool get_gradients);
diff --git a/src/runtime/request_manager.cc b/src/runtime/request_manager.cc
new file mode 100644
index 0000000000..d75b0fbe0b
--- /dev/null
+++ b/src/runtime/request_manager.cc
@@ -0,0 +1,1737 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/request_manager.h"
+#include "flexflow/parallel_ops/parallel_op.h"
+// #include "flexflow/tokenizers.h"
+#include <filesystem>
+#include <iomanip>
+#include <new>
+#include <stack>
+#include <stdexcept>
+
+namespace FlexFlow {
+
+using namespace Legion;
+using tokenizers::Tokenizer;
+
+LegionRuntime::Logger::Category log_req_mgr("RequestManager");
+
+std::string LoadBytesFromFile(std::string const &path) {
+  std::ifstream fs(path, std::ios::in | std::ios::binary);
+  assert(!fs.fail() && "no such file");
+  std::string data;
+  fs.seekg(0, std::ios::end);
+  size_t size = static_cast<size_t>(fs.tellg());
+  fs.seekg(0, std::ios::beg);
+  data.resize(size);
+  fs.read(data.data(), size);
+  return data;
+}
+
+RequestManager::RequestManager()
+    : verbose(false), next_available_guid(1000000), num_processed_requests(0) {
+  {
+    // Initialize futures for spec infer
+    TreeVerifyBatchConfig tree_bc;
+    InferenceResult tree_ir;
+    TreeVerifyBatchConfigFuture tree_bcf =
+        Future::from_value<TreeVerifyBatchConfig>(tree_bc);
+    InferenceResultFuture tree_irf =
+        Future::from_value<InferenceResult>(tree_ir);
+    last_tree_bcf = tree_bcf;
+    last_tree_irf = tree_irf;
+  }
+  {
+    // Initialize futures for incr decoding
+    BatchConfig bc;
+    InferenceResult ir;
+    BatchConfigFuture bcf = Future::from_value<BatchConfig>(bc);
+    InferenceResultFuture irf = Future::from_value<InferenceResult>(ir);
+    last_bcf = bcf;
+    last_irf = irf;
+  }
+}
+
+void RequestManager::register_tokenizer(ModelType type,
+                                        int bos_token_id,
+                                        int eos_token_id,
+                                        std::string const &path) {
+  this->model_type = type;
+  this->bos_token_id = bos_token_id;
+  this->eos_token_id = eos_token_id;
+  std::string tokenizer_folder =
+      (!path.empty() && path.back() != '/') ? path + '/' : path;
+  if (model_type == ModelType::LLAMA || model_type == ModelType::LLAMA2) {
+    bool path_to_file = !path.empty() &&
+                        (path.size() >= strlen("tokenizer.model")) &&
+                        path.find("tokenizer.model") ==
+                            (path.size() - strlen("tokenizer.model"));
+    std::string tokenizer_filepath =
+        path_to_file ? path : tokenizer_folder + "tokenizer.model";
+    this->tokenizer_ =
+        Tokenizer::FromBlobSentencePiece(LoadBytesFromFile(tokenizer_filepath));
+  } else if (model_type == ModelType::OPT) {
+    std::string vocab_file = tokenizer_folder + "vocab.json";
+    std::string merges_file = tokenizer_folder + "merges.txt";
+    std::string added_tokens_file =
+        tokenizer_folder + "special_tokens_map.json";
+    std::filesystem::path path1(vocab_file);
+    std::filesystem::path path2(merges_file);
+    std::filesystem::path path3(added_tokens_file);
+    assert(std::filesystem::exists(path1) &&
+           "Vocab file vocab.json does not exist at the specified path");
+    assert(std::filesystem::exists(path2) &&
+           "Merge file merges.txt does not exist at the specified path");
+    // opt_tokenizer = new OptTokenizer(vocab_file, merges_file);
+    std::string vocab = LoadBytesFromFile(path1.string());
+    std::string merges = LoadBytesFromFile(path2.string());
+    std::string added_tokens = LoadBytesFromFile(path3.string());
+
+    this->tokenizer_ =
+        Tokenizer::FromBlobByteLevelBPE(vocab, merges, added_tokens);
+  } else if (model_type == ModelType::FALCON ||
+             model_type == ModelType::STARCODER) {
+    std::string falcon_tokenizer_path = join_path({path, "tokenizer.json"});
+    this->tokenizer_ =
+        Tokenizer::FromBlobJSON(LoadBytesFromFile(falcon_tokenizer_path));
+  }
+}
+
+void RequestManager::register_output_filepath(
+    std::string const &_output_filepath) {
+  this->output_filepath = _output_filepath;
+}
+
+int RequestManager::register_ssm_model(FFModel *model) {
+  int model_id = models.size();
+  models.push_back(model);
+  std::cout << "Register new model with id: " << model_id << std::endl;
+  return model_id;
+}
+
+FFModel *RequestManager::get_model(int model_id) {
+  assert(model_id < models.size());
+  return models[model_id];
+}
+
+size_t RequestManager::get_num_ssms() {
+  return models.size();
+}
+
+RequestManager::RequestGuid
+    RequestManager::register_new_request(std::vector<TokenId> const &prompt,
+                                         int max_sequence_length) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+
+  // Add a new request
+  Request request;
+  request.status = Request::PENDING;
+  request.guid = next_available_guid++;
+  request.max_sequence_length = max_sequence_length;
+
+  if (prompt.size() > BatchConfig::MAX_PROMPT_LENGTH) {
+    std::cout << "Warning: too many tokens in prompt, only load up to "
+              << BatchConfig::MAX_PROMPT_LENGTH << " tokens, but got "
+              << prompt.size() << ".\n";
+    // Truncate the prompt to MAX_NUM_TOKENS
+    // request.tokens.insert(request.tokens.end(),
+    //                       prompt.begin(),
+    //                       prompt.begin() + BatchConfig::MAX_PROMPT_LENGTH);
+    // request.initial_len = BatchConfig::MAX_PROMPT_LENGTH;
+    printf("tokens size: %zu\n", request.tokens.size());
+    // assert(false);
+    return 0;
+  } else {
+    request.initial_len = prompt.size();
+    request.tokens = prompt;
+  }
+
+  if (get_num_ssms() == 0) {
+    std::cout << "No small speculative model registered, using incremental "
+                 "decoding."
+              << std::endl;
+  } else {
+    std::cout << "Num of models: " << get_num_ssms() << std::endl;
+    for (int i = 0; i < get_num_ssms(); i++) {
+      BeamTree beam_tree = BeamTree{};
+      request.beam_trees.push_back(beam_tree);
+    }
+  }
+
+  pending_request_queue.push(request);
+  all_requests[request.guid] = request;
+
+  if (verbose) {
+    std::cout << "new req: " << request.tokens.size() << std::endl;
+    for (int i = 0; i < request.tokens.size(); i++) {
+      std::cout << i << " : " << request.tokens[i] << std::endl;
+    }
+  }
+
+  GenerationResult gr;
+  gr.guid = request.guid;
+  gr.input_text = "";
+  gr.input_tokens = prompt;
+  gr.output_text = "";
+  gr.output_tokens = prompt;
+  request_generation_results[request.guid] = gr;
+
+  return request.guid;
+}
+
+RequestManager::RequestGuid
+    RequestManager::register_new_request(std::string const &prompt,
+                                         int max_sequence_length) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+  // Add a new request
+  Request request;
+  request.status = Request::PENDING;
+  request.guid = next_available_guid++;
+  request.max_sequence_length = max_sequence_length;
+  request.tokens.push_back(bos_token_id);
+
+  std::vector<int32_t> tokens = this->tokenizer_->Encode(prompt);
+  if (tokens.size() > BatchConfig::MAX_PROMPT_LENGTH) {
+    std::cout << "Warning: too many tokens in prompt, only load up to "
+              << BatchConfig::MAX_PROMPT_LENGTH << " tokens, but got "
+              << tokens.size() << ".\n";
+    // Truncate the prompt to MAX_NUM_TOKENS
+    // tokens.resize(BatchConfig::MAX_PROMPT_LENGTH);
+    printf("tokens size: %zu\n", tokens.size());
+    // assert(false);
+    return 0;
+  }
+  for (int i = 0; i < tokens.size(); i++) {
+    std::cout << "[" << i << "]" << tokens.at(i) << "\n";
+  }
+  request.tokens.insert(request.tokens.end(), tokens.begin(), tokens.end());
+  request.initial_len = request.tokens.size();
+
+  if (get_num_ssms() == 0) {
+    std::cout << "No small speculative model registered, using incremental "
+                 "decoding."
+              << std::endl;
+  } else {
+    std::cout << "Num of models: " << get_num_ssms() << std::endl;
+    for (int i = 0; i < get_num_ssms(); i++) {
+      BeamTree beam_tree = BeamTree{};
+      request.beam_trees.push_back(beam_tree);
+    }
+  }
+
+  pending_request_queue.push(request);
+  all_requests[request.guid] = request;
+  {
+    std::string output = "New request tokens:";
+    for (int i = 0; i < request.tokens.size(); i++) {
+      output = output + " " + std::to_string(request.tokens[i]);
+    }
+    log_req_mgr.print("%s", output.c_str());
+  }
+
+  GenerationResult gr;
+  gr.guid = request.guid;
+  gr.input_text = prompt;
+  gr.input_tokens = request.tokens;
+  gr.output_text = prompt;
+  gr.output_tokens = request.tokens;
+  request_generation_results[request.guid] = gr;
+  return request.guid;
+}
+
+bool RequestManager::is_request_completed(RequestGuid const &guid) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+  assert(all_requests.find(guid) != all_requests.end());
+  Request const &request = all_requests[guid];
+  // return request.tokens.size() >= request.max_sequence_length;
+  return request.status == Request::COMPLETED;
+}
+
+GenerationResult
+    RequestManager::get_generation_result(RequestGuid const &guid) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+  assert(request_generation_results.find(guid) !=
+         request_generation_results.end());
+  return request_generation_results[guid];
+}
+
+size_t RequestManager::get_num_processed_requests() {
+  return num_processed_requests;
+}
+
+BatchConfigFuture
+    RequestManager::prepare_next_batch(BatchConfigFuture const &old_bc,
+                                       InferenceResultFuture const &result) {
+  Runtime *runtime = Runtime::get_runtime();
+  Context ctx = Runtime::get_context();
+
+  RequestManager *rm = this;
+  TaskLauncher launcher(RM_PREPARE_NEXT_BATCH_TASK_ID,
+                        TaskArgument(&rm, sizeof(RequestManager *)));
+  launcher.add_future(old_bc);
+  launcher.add_future(result);
+  return runtime->execute_task(ctx, launcher);
+}
+
+BatchConfig RequestManager::prepare_next_batch_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  RequestManager *rm = *((RequestManager **)task->args);
+  BatchConfig const *bc = BatchConfig::from_future(task->futures[0]);
+  InferenceResult const &result =
+      Future(task->futures[1]).get_result<InferenceResult>();
+  return rm->prepare_next_batch(*bc, result);
+}
+
+BatchConfig RequestManager::prepare_next_batch(BatchConfig const &old_bc,
+                                               InferenceResult const &result) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+  // Step 1: append result from previous iteration to request's tokens
+  for (int i = 0; i < old_bc.num_tokens; i++) {
+    size_t guid =
+        old_bc.requestsInfo[old_bc.tokensInfo[i].request_index].request_guid;
+    Request &request = all_requests[guid];
+    if (old_bc.tokensInfo[i].abs_depth_in_request + 1 < request.tokens.size()) {
+      // This is a prompt token
+      continue;
+    } else {
+      assert(old_bc.tokensInfo[i].abs_depth_in_request + 1 ==
+             request.tokens.size());
+      // This is a decoding token
+      log_req_mgr.print("Output token is: %d", result.token_ids[i]);
+      request.tokens.push_back(result.token_ids[i]);
+      // std::string output = this->tokenizer_->Decode(request.tokens);
+      // log_req_mgr.print("Output: %s", output.c_str());
+    }
+  }
+  // Step 2: prepare the next batch for existing requests
+  BatchConfig new_bc;
+  for (int i = 0; i < BatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (old_bc.request_completed[i]) {
+      continue;
+    }
+    assert(old_bc.requestsInfo[i].num_tokens_in_batch > 0);
+    Request &request = all_requests[old_bc.requestsInfo[i].request_guid];
+    int processed_tokens = old_bc.requestsInfo[i].token_start_offset +
+                           old_bc.requestsInfo[i].num_tokens_in_batch;
+    assert(processed_tokens < request.tokens.size());
+    bool request_completed = false;
+    // printf("model_type = %d\n", this->model_type);
+    if (request.tokens.size() >= old_bc.requestsInfo[i].max_sequence_length) {
+      request_completed = true;
+    } else if (request.tokens.back() == eos_token_id) {
+      // Encounter EOS token id
+      request_completed = true;
+    }
+    if (request_completed) {
+      request.status = Request::COMPLETED;
+      log_req_mgr.print("[Done] guid(%zu) final_length(%zu)",
+                        old_bc.requestsInfo[i].request_guid,
+                        request.tokens.size());
+      std::string output = this->tokenizer_->Decode(request.tokens);
+
+      {
+        // update generation result and trigger future
+        GenerationResult &gr = request_generation_results[request.guid];
+        assert(gr.guid == request.guid);
+        gr.output_tokens = request.tokens;
+        gr.output_text = output;
+      }
+      log_req_mgr.print("Final output: %s", output.c_str());
+      num_processed_requests++;
+      ProfileInfo profile_info = profiling_requests[request.guid];
+      profile_info.finish_time = Realm::Clock::current_time_in_microseconds();
+      total_request_run_time +=
+          profile_info.finish_time - profile_info.start_time;
+      profiling_requests[request.guid] = profile_info;
+      log_req_mgr.print("[Profile] guid(%zu) decoding_steps(%d) start(%.1lf) "
+                        "finish(%.1lf) latency(%.1lf)",
+                        request.guid,
+                        profile_info.decoding_steps,
+                        profile_info.start_time,
+                        profile_info.finish_time,
+                        profile_info.finish_time - profile_info.start_time);
+      // Write output to file if needed:
+      if (!output_filepath.empty()) {
+        std::ofstream outputFile(output_filepath);
+        if (outputFile.is_open()) {
+          outputFile << "end-to-end latency: " << std::fixed
+                     << std::setprecision(3) << total_request_run_time
+                     << std::endl;
+          outputFile << "num decoding steps: " << profile_info.decoding_steps
+                     << std::endl;
+          outputFile << "token IDs: ";
+          for (int i = 0; i < request.tokens.size(); i++) {
+            outputFile << request.tokens[i];
+            if (i < request.tokens.size() - 1) {
+              outputFile << ",";
+            }
+          }
+          outputFile << std::endl;
+          outputFile << output;
+          outputFile.close();
+        } else {
+          std::cout << "Unable to open the output file: " << output_filepath
+                    << std::endl;
+          assert(false);
+        }
+      }
+
+      // std::cout << "print results: " << std::endl;
+      // for (int i = 0; i < request.tokens.size(); i++) {
+      //   std::cout << request.tokens.at(i) << ", ";
+      // }
+    } else {
+      new_bc.request_completed[i] = false;
+      new_bc.requestsInfo[i].token_start_offset = processed_tokens;
+      new_bc.requestsInfo[i].request_guid = old_bc.requestsInfo[i].request_guid;
+      new_bc.requestsInfo[i].max_sequence_length =
+          old_bc.requestsInfo[i].max_sequence_length;
+      if (new_bc.requestsInfo[i].token_start_offset + 1 ==
+          request.tokens.size()) {
+        // Incremental phase
+        new_bc.requestsInfo[i].num_tokens_in_batch = 1;
+      } else {
+        // Prompt phase
+        new_bc.requestsInfo[i].num_tokens_in_batch =
+            std::min(BatchConfig::MAX_NUM_TOKENS - new_bc.num_tokens,
+                     (int)request.tokens.size() -
+                         new_bc.requestsInfo[i].token_start_offset);
+      }
+      for (int j = 0; j < new_bc.requestsInfo[i].num_tokens_in_batch; j++) {
+        int depth = new_bc.requestsInfo[i].token_start_offset + j;
+        new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+        new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = depth;
+        assert(depth < request.tokens.size());
+        new_bc.tokensInfo[new_bc.num_tokens].token_id = request.tokens[depth];
+        new_bc.num_tokens++;
+      }
+      // Update profiling
+      profiling_requests[new_bc.requestsInfo[i].request_guid].decoding_steps++;
+    }
+  }
+  // Step 3: add new requests to the next batch
+  for (int i = 0; i < BatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (new_bc.request_completed[i]) {
+      if (!pending_request_queue.empty() &&
+          new_bc.num_tokens < BatchConfig::MAX_NUM_TOKENS) {
+        Request new_request = pending_request_queue.front();
+        pending_request_queue.pop();
+        // all_requests[new_request.guid] = new_request;
+        new_bc.requestsInfo[i].token_start_offset = 0;
+        new_bc.requestsInfo[i].request_guid = new_request.guid;
+        new_bc.requestsInfo[i].num_tokens_in_batch =
+            std::min(BatchConfig::MAX_NUM_TOKENS - new_bc.num_tokens,
+                     (int)new_request.tokens.size());
+        new_bc.requestsInfo[i].max_sequence_length =
+            new_request.max_sequence_length;
+        new_bc.request_completed[i] = false;
+        // add profile_info for the new request
+        ProfileInfo profile_info;
+        profile_info.decoding_steps = 1;
+        profile_info.start_time = Realm::Clock::current_time_in_microseconds();
+        profiling_requests[new_request.guid] = profile_info;
+        for (int j = 0; j < new_bc.requestsInfo[i].num_tokens_in_batch; j++) {
+          int depth = new_bc.requestsInfo[i].token_start_offset + j;
+          new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+          new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = depth;
+          assert(depth < new_request.tokens.size());
+          new_bc.tokensInfo[new_bc.num_tokens].token_id =
+              new_request.tokens[depth];
+          new_bc.num_tokens++;
+        }
+        if (new_bc.num_tokens == BatchConfig::MAX_NUM_TOKENS) {
+          break;
+        }
+      }
+    }
+  }
+  return new_bc;
+}
+
+/* ----- Speculative Inference Specific functions ----- */
+BeamSearchBatchConfigFuture RequestManager::prepare_next_batch_beam(
+    BeamSearchBatchConfigFuture const &old_bc,
+    BeamInferenceResultFuture const &result) {
+  Runtime *runtime = Runtime::get_runtime();
+  Context ctx = Runtime::get_context();
+
+  RequestManager *rm = this;
+  TaskLauncher launcher(RM_PREPARE_NEXT_BATCH_BEAM_TASK_ID,
+                        TaskArgument(&rm, sizeof(RequestManager *)));
+  launcher.add_future(old_bc);
+  launcher.add_future(result);
+  return runtime->execute_task(ctx, launcher);
+}
+
+BeamSearchBatchConfig RequestManager::prepare_next_batch_beam_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  RequestManager *rm = *((RequestManager **)task->args);
+  BeamSearchBatchConfig const &bc =
+      Future(task->futures[0]).get_result<BeamSearchBatchConfig>();
+  BeamInferenceResult const &result =
+      Future(task->futures[1]).get_result<BeamInferenceResult>();
+  return rm->prepare_next_batch_beam(bc, result);
+}
+
+// update beam search metadata
+BeamSearchBatchConfig
+    RequestManager::prepare_next_batch_beam(BeamSearchBatchConfig const &old_bc,
+                                            BeamInferenceResult const &result) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+  if (verbose) {
+    std::cout << "\n############### prepare_next_batch_beam ###############\n";
+  }
+  if (verbose) {
+    std::cout << "print all results"
+              << "\n";
+    for (int i = 0; i < 40; i++) {
+      std::cout << result.token_ids[i] << ", ";
+    }
+    std::cout << "Current Beam Depth: "
+              << old_bc.beamRequestsInfo[0].current_depth << "\n";
+  }
+
+  // Step 1: Store result to the beam tree struct
+  store_beam_metadata(old_bc, result);
+
+  // Step 2: preparing the next batch for existing requests
+  BeamSearchBatchConfig new_bc;
+  new_bc.max_init_length = 0;
+  new_bc.model_id = old_bc.model_id;
+  // std::cout << "old_bc.model_id: " << old_bc.model_id << "\n";
+
+  for (int i = 0; i < BatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (old_bc.request_completed[i]) {
+      continue;
+    }
+    // Comment out this assertion since num_tokens_in_batch can be
+    // zero when beam search has reached required sequence length
+    // assert(old_bc.requestsInfo[i].num_tokens_in_batch > 0);
+    Request &request = all_requests[old_bc.requestsInfo[i].request_guid];
+    int processed_tokens = old_bc.requestsInfo[i].token_start_offset +
+                           old_bc.requestsInfo[i].num_tokens_in_batch;
+
+    // assert(processed_tokens < request.tokens.size());
+    log_req_mgr.debug() << "processed_tokens: " << processed_tokens << "\n";
+    if (processed_tokens >
+        old_bc.beamRequestsInfo[i].max_depth + request.tokens.size()
+        // || ir.results[t] == 0 TODO: replace this with <EOS>
+    ) {
+      log_req_mgr.print("[Done] guid(%zu) with spec_tree_depth(%d)",
+                        old_bc.requestsInfo[i].request_guid,
+                        old_bc.beamRequestsInfo[i].max_depth);
+      // new_bc.request_completed[i] = true;
+      new_bc.request_completed[i] = false;
+      new_bc.requestsInfo[i].token_start_offset = processed_tokens;
+      new_bc.requestsInfo[i].request_guid = old_bc.requestsInfo[i].request_guid;
+      new_bc.requestsInfo[i].max_sequence_length =
+          old_bc.requestsInfo[i].max_sequence_length;
+    } else {
+      log_req_mgr.debug() << "num tokens: " << old_bc.num_tokens << ", "
+                          << new_bc.num_tokens;
+      new_bc.request_completed[i] = false;
+      new_bc.requestsInfo[i].token_start_offset = processed_tokens;
+      new_bc.requestsInfo[i].request_guid = old_bc.requestsInfo[i].request_guid;
+      new_bc.requestsInfo[i].max_sequence_length =
+          old_bc.requestsInfo[i].max_sequence_length;
+
+      // update the beam search metadata
+      // how many sub request in current request
+      // why is sub_requests has MAX_NUM_REQUESTS * MAX_BEAM_WIDTH entries?
+      new_bc.sub_requests[i] = old_bc.beamRequestsInfo[i].beam_size;
+      // update the parentid, accumalated_probs, depth, and token_ids
+      new_bc.beamRequestsInfo[i].current_depth =
+          old_bc.beamRequestsInfo[i].current_depth + 1;
+      new_bc.beamRequestsInfo[i].beam_size =
+          old_bc.beamRequestsInfo[i].beam_size;
+      new_bc.beamRequestsInfo[i].max_depth =
+          old_bc.beamRequestsInfo[i].max_depth;
+
+      // do the slot exchange to minimize the cache exchange in kernel.
+      // std::cout << "update metadata" << std::endl;
+      update_beam_metadata(new_bc, request.beam_trees.at(old_bc.model_id), i);
+
+      if (new_bc.requestsInfo[i].token_start_offset + 1 >=
+          request.tokens.size()) {
+        // Incremental phase
+        new_bc.requestsInfo[i].num_tokens_in_batch = 1;
+      } else {
+        // Prompt phase
+        new_bc.requestsInfo[i].num_tokens_in_batch =
+            std::min(BatchConfig::MAX_NUM_TOKENS - new_bc.num_tokens,
+                     (int)request.tokens.size() -
+                         new_bc.requestsInfo[i].token_start_offset);
+      }
+
+      // register more tokens due to the beam width
+      for (int j = 0; j < new_bc.requestsInfo[i].num_tokens_in_batch; j++) {
+        int depth = new_bc.requestsInfo[i].token_start_offset + j;
+        for (int k = 0; k < new_bc.sub_requests[i]; k++) {
+          new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+          new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = depth;
+
+          // get value from requestinfo
+          new_bc.tokensInfo[new_bc.num_tokens].token_id =
+              new_bc.beamRequestsInfo[i].tokens[k];
+          // request.tokens[depth];
+          new_bc.beamTokenInfo[new_bc.num_tokens].sub_request_index = k;
+          new_bc.num_tokens++;
+        }
+      }
+    }
+  }
+  if (verbose) {
+    std::cout << "prepare_next_batch_beam OLD vs NEW batchconfigs:"
+              << std::endl;
+    old_bc.print();
+    new_bc.print();
+  }
+  return new_bc;
+}
+
+BeamSearchBatchConfigFuture RequestManager::prepare_next_batch_init(
+    TreeVerifyBatchConfigFuture const &old_bc,
+    InferenceResultFuture const &result,
+    int model_id) {
+  Runtime *runtime = Runtime::get_runtime();
+  Context ctx = Runtime::get_context();
+
+  RequestManager *rm = this;
+  TaskLauncher launcher(RM_PREPARE_NEXT_BATCH_INIT_TASK_ID,
+                        TaskArgument(&rm, sizeof(RequestManager *)));
+  launcher.add_future(old_bc);
+  launcher.add_future(result);
+  launcher.add_future(Future::from_value<int>(model_id));
+  return runtime->execute_task(ctx, launcher);
+}
+
+BeamSearchBatchConfig RequestManager::prepare_next_batch_init_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  RequestManager *rm = *((RequestManager **)task->args);
+  TreeVerifyBatchConfig const &bc =
+      Future(task->futures[0]).get_result<TreeVerifyBatchConfig>();
+  InferenceResult const &result =
+      Future(task->futures[1]).get_result<InferenceResult>();
+  int model_id = Future(task->futures[2]).get_result<int>();
+  return rm->prepare_next_batch_init(bc, result, model_id);
+}
+
+BeamSearchBatchConfig
+    RequestManager::prepare_next_batch_init(TreeVerifyBatchConfig const &old_bc,
+                                            InferenceResult const &result,
+                                            int model_id) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+  if (verbose) {
+    std::cout << "\n############### prepare_next_batch_init ###############\n";
+  }
+  // Step 1: use result to update requests
+  BeamSearchBatchConfig new_bc;
+  new_bc.num_tokens = 0;
+  new_bc.model_id = model_id;
+  int result_index = 0;
+
+  for (int i = 0; i < BatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (old_bc.request_completed[i]) {
+      continue;
+    }
+    size_t guid = old_bc.requestsInfo[i].request_guid;
+    Request &request = all_requests[guid];
+
+    // Verify this: get verified tokens from result
+    std::vector<std::pair<BatchConfig::TokenId, int>> tree_outputs =
+        std::vector<std::pair<BatchConfig::TokenId, int>>();
+
+    assert(old_bc.num_tokens > 0);
+
+    int start_depth = old_bc.tokensInfo[result_index].abs_depth_in_request;
+    if (committed_tokens.find(guid) == committed_tokens.end()) {
+      committed_tokens[guid] = std::vector<std::pair<int, int>>();
+    } else {
+      committed_tokens.at(guid).clear();
+    }
+    // iterate through all the tokens that belong to request i
+    while (result_index < old_bc.num_tokens &&
+           old_bc.tokensInfo[result_index].request_index == i) {
+      // new tokens have not been appended yet, so the last appended token is
+      // the root of the beam search token tree
+      int root_abs_depth = request.tokens.size() - 1;
+      if (old_bc.tokensInfo[result_index].abs_depth_in_request >=
+          root_abs_depth) {
+        // append to tree_outputs a pair consisting of (token id, depth)
+        tree_outputs.push_back(std::make_pair(
+            result.token_ids[result_index],
+            old_bc.tokensInfo[result_index].abs_depth_in_request + 1));
+        // append (depth, index of the token in result) to committed_tokens
+        // array
+        committed_tokens.at(guid).push_back(
+            std::make_pair(old_bc.tokensInfo[result_index].abs_depth_in_request,
+                           result_index));
+
+        if (verbose) {
+          std::cout << "Index within old batch: " << result_index << std::endl;
+          printf("  Input: [%d] %d ---> [%d] %d \n",
+                 old_bc.tokensInfo[result_index].abs_depth_in_request,
+                 old_bc.tokensInfo[result_index].token_id,
+                 tree_outputs.back().second,
+                 tree_outputs.back().first);
+        }
+        // std::cout << "  Input: " << old_bc.tokensInfo[result_index].token_id
+        // << ""
+        //   << old_bc.tokensInfo[result_index].abs_depth_in_request <<
+        //   std::endl;
+        // std::cout << "  Result: " << result.token_ids[result_index] << ",
+        // depth: "
+        //   << old_bc.tokensInfo[result_index].abs_depth_in_request + 1 <<
+        //   std::endl;
+      }
+      result_index++;
+    }
+
+    std::vector<std::pair<BatchConfig::TokenId, int>> verified_tokens =
+        traverse_verify_tree(guid, dfs_tree_inputs.at(guid), tree_outputs);
+    log_req_mgr.print("Number of Verified Tokens = %zu",
+                      verified_tokens.size());
+    // check if the request is finished
+    if (verified_tokens.size() + request.tokens.size() >=
+        request.max_sequence_length) {
+      // Append all verified tokens to the request
+      for (int j = 0; j < verified_tokens.size(); j++) {
+        if (verified_tokens[j].second < request.max_sequence_length) {
+          request.tokens.push_back(verified_tokens[j].first);
+        }
+      }
+      request.status = Request::COMPLETED;
+      log_req_mgr.print("[Done] guid(%zu) with final length(%zu)",
+                        request.guid,
+                        request.tokens.size());
+      std::string output = this->tokenizer_->Decode(request.tokens);
+      {
+        // update generation result and trigger future
+        GenerationResult &gr = request_generation_results[request.guid];
+        assert(gr.guid == request.guid);
+        gr.output_tokens = request.tokens;
+        gr.output_text = output;
+      }
+      log_req_mgr.print("Final output: %s", output.c_str());
+      new_bc.request_completed[i] = true;
+      num_processed_requests++;
+      ProfileInfo profile_info = profiling_requests[request.guid];
+      profile_info.finish_time = Realm::Clock::current_time_in_microseconds();
+      total_request_run_time +=
+          profile_info.finish_time - profile_info.start_time;
+      profiling_requests[request.guid] = profile_info;
+      log_req_mgr.print("[Profile] guid(%zu) decoding_steps(%d) start(%.1lf) "
+                        "finish(%.1lf) latency(%.1lf)",
+                        request.guid,
+                        profile_info.decoding_steps,
+                        profile_info.start_time,
+                        profile_info.finish_time,
+                        profile_info.finish_time - profile_info.start_time);
+
+      // Write output to file if needed:
+      if (!output_filepath.empty()) {
+        std::ofstream outputFile(output_filepath);
+        if (outputFile.is_open()) {
+          outputFile << "end-to-end latency: " << std::fixed
+                     << std::setprecision(3) << total_request_run_time
+                     << std::endl;
+          outputFile << "num decoding steps: " << profile_info.decoding_steps
+                     << std::endl;
+          outputFile << "token IDs: ";
+          for (int i = 0; i < request.tokens.size(); i++) {
+            outputFile << request.tokens[i];
+            if (i < request.tokens.size() - 1) {
+              outputFile << ",";
+            }
+          }
+          outputFile << std::endl;
+          outputFile << output;
+          outputFile.close();
+        } else {
+          std::cout << "Unable to open the output file: " << output_filepath
+                    << std::endl;
+          assert(false);
+        }
+      }
+
+      // delete the old input tree from cache
+      dfs_tree_inputs.erase(request.guid);
+
+      continue;
+    }
+
+    new_bc.request_completed[i] = false;
+
+    // Normal Request Info
+    new_bc.requestsInfo[i].token_start_offset = verified_tokens.front().second;
+    new_bc.requestsInfo[i].request_guid = old_bc.requestsInfo[i].request_guid;
+    new_bc.requestsInfo[i].max_sequence_length =
+        old_bc.requestsInfo[i].max_sequence_length;
+    new_bc.requestsInfo[i].num_tokens_in_batch = verified_tokens.size();
+
+    // TODO: Beam Request Info, missing from VerifyTreeBatchConfig
+    int new_max_depth = new_bc.requestsInfo[i].max_sequence_length -
+                        new_bc.requestsInfo[i].token_start_offset -
+                        verified_tokens.size();
+    new_bc.beamRequestsInfo[i].current_depth = 1;
+    new_bc.beamRequestsInfo[i].beam_size =
+        BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+    new_bc.beamRequestsInfo[i].max_depth =
+        std::min(new_max_depth, BeamSearchBatchConfig::MAX_BEAM_DEPTH);
+    for (int j = 0; j < BeamSearchBatchConfig::MAX_BEAM_WIDTH; j++) {
+      new_bc.beamRequestsInfo[i].parent_id[j] = 0;
+      new_bc.beamRequestsInfo[i].probs[j] = 1;
+    }
+
+    new_bc.sub_requests[i] = 1;
+
+    // Token Info
+    for (int j = 0; j < verified_tokens.size(); j++) {
+      auto token = verified_tokens.at(j);
+
+      // Normal Token Info
+      new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+      new_bc.tokensInfo[new_bc.num_tokens].token_id = token.first;
+      new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = token.second;
+
+      // Beam Token Info
+      new_bc.beamTokenInfo[new_bc.num_tokens].sub_request_index = 0;
+      new_bc.num_tokens++;
+
+      // Add verified token to request's token list
+      request.tokens.push_back(token.first);
+
+      if (new_bc.num_tokens == BatchConfig::MAX_NUM_TOKENS) {
+        break;
+      }
+    }
+    std::string output = this->tokenizer_->Decode(request.tokens);
+    log_req_mgr.print("Output: %s", output.c_str());
+  }
+
+  // Step 2: Initialize new request
+  new_bc.max_init_length = 0;
+  for (int i = 0; i < BeamSearchBatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (new_bc.request_completed[i]) {
+      if (!pending_request_queue.empty() &&
+          new_bc.num_tokens < BeamSearchBatchConfig::MAX_NUM_TOKENS) {
+        Request new_request = pending_request_queue.front();
+        pending_request_queue.pop();
+        new_bc.max_init_length =
+            std::max(new_bc.max_init_length, new_request.initial_len);
+        // all_requests[new_request.guid] = new_request;
+        new_bc.requestsInfo[i].token_start_offset = 0;
+        new_bc.requestsInfo[i].request_guid = new_request.guid;
+        new_bc.requestsInfo[i].num_tokens_in_batch =
+            std::min(BeamSearchBatchConfig::MAX_NUM_TOKENS - new_bc.num_tokens,
+                     (int)new_request.tokens.size());
+        new_bc.requestsInfo[i].max_sequence_length =
+            new_request.max_sequence_length;
+
+        // add profile_info for the new request
+        ProfileInfo profile_info;
+        profile_info.decoding_steps = 0;
+        profile_info.start_time = Realm::Clock::current_time_in_microseconds();
+        profiling_requests[new_request.guid] = profile_info;
+        // init the beam search metadata per request
+        new_bc.beamRequestsInfo[i].beam_size =
+            BeamSearchBatchConfig::MAX_BEAM_WIDTH;
+        new_bc.beamRequestsInfo[i].current_depth = 1;
+        new_bc.beamRequestsInfo[i].max_depth =
+            std::min(BeamSearchBatchConfig::MAX_BEAM_DEPTH,
+                     BatchConfig::MAX_NUM_TOKENS -
+                         new_bc.requestsInfo[i].num_tokens_in_batch - 1);
+        for (int j = 0; j < BeamSearchBatchConfig::MAX_BEAM_WIDTH; j++) {
+          new_bc.beamRequestsInfo[i].parent_id[j] = 0;
+          new_bc.beamRequestsInfo[i].probs[j] = 1;
+        }
+
+        new_bc.request_completed[i] = false;
+        new_bc.sub_requests[i] = 1;
+
+        for (int j = 0; j < new_bc.requestsInfo[i].num_tokens_in_batch; j++) {
+          int depth = new_bc.requestsInfo[i].token_start_offset + j;
+          new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+          new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = depth;
+          assert(depth < new_request.tokens.size());
+          new_bc.tokensInfo[new_bc.num_tokens].token_id =
+              new_request.tokens[depth];
+
+          // beam search meta data, indicate which sub request this token
+          // belongs to, init to 0;
+          new_bc.beamTokenInfo[new_bc.num_tokens].sub_request_index = 0;
+          new_bc.num_tokens++;
+        }
+        if (new_bc.num_tokens == BatchConfig::MAX_NUM_TOKENS) {
+          break;
+        }
+      }
+    }
+  }
+
+  if (verbose) {
+    std::cout << "prepare_next_batch_init OLD vs NEW batchconfigs below:"
+              << std::endl;
+    old_bc.print();
+    new_bc.print();
+  }
+  return new_bc;
+}
+
+TreeVerifyBatchConfigFuture RequestManager::prepare_next_batch_verify(
+    std::vector<BeamSearchBatchConfigFuture> const &old_batches) {
+  Runtime *runtime = Runtime::get_runtime();
+  Context ctx = Runtime::get_context();
+
+  RequestManager *rm = this;
+  TaskLauncher launcher(RM_PREPARE_NEXT_BATCH_VERIFY_TASK_ID,
+                        TaskArgument(&rm, sizeof(RequestManager *)));
+  for (auto const &bcf : old_batches) {
+    launcher.add_future(bcf);
+  }
+  return runtime->execute_task(ctx, launcher);
+}
+
+TreeVerifyBatchConfig RequestManager::prepare_next_batch_verify_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  RequestManager *rm = *((RequestManager **)task->args);
+  std::vector<BeamSearchBatchConfig> old_batches;
+  for (auto const &bcf : task->futures) {
+    old_batches.push_back(Future(bcf).get_result<BeamSearchBatchConfig>());
+  }
+  return rm->prepare_next_batch_verify(old_batches);
+}
+
+TreeVerifyBatchConfig RequestManager::prepare_next_batch_verify(
+    std::vector<BeamSearchBatchConfig> const &old_batches) {
+  const std::lock_guard<std::mutex> lock(request_queue_mutex);
+
+  if (verbose) {
+    std::cout
+        << "\n############### prepare_next_batch_verify ###############\n";
+  }
+  assert(old_batches.size() > 0);
+
+  TreeVerifyBatchConfig new_bc;
+  new_bc.num_tokens_to_commit = 0;
+  new_bc.num_tokens = 0;
+
+  for (int i = 0; i < TreeVerifyBatchConfig::MAX_NUM_REQUESTS; i++) {
+    if (old_batches.at(0).request_completed[i]) {
+      continue;
+    }
+    size_t guid = old_batches.at(0).requestsInfo[i].request_guid;
+    Request &request = all_requests[guid];
+
+    // Get the dfs tree
+    std::vector<std::vector<std::pair<BatchConfig::TokenId, int>>>
+        all_dfs_trees;
+
+    for (int j = 0; j < old_batches.size(); j++) {
+      std::vector<std::pair<BatchConfig::TokenId, int>> new_tree =
+          traverse_beam_tree(old_batches.at(j), i, request.tokens.size() - 1);
+      all_dfs_trees.push_back(new_tree);
+    }
+    assert(all_dfs_trees.size() == old_batches.size());
+    std::vector<std::pair<BatchConfig::TokenId, int>> dfs_tree_inputs =
+        merge_dfs_trees(all_dfs_trees, request.tokens.size() - 1, guid);
+
+    if (verbose) {
+      std::cout << "Request Tokens Size: " << request.tokens.size()
+                << std::endl;
+      for (int k = 0; k < request.tokens.size(); k++) {
+        std::cout << k << ": " << request.tokens[k] << std::endl;
+      }
+    }
+
+    // Normal Request Info
+    new_bc.requestsInfo[i].token_start_offset = dfs_tree_inputs.front().second;
+    new_bc.requestsInfo[i].request_guid =
+        old_batches.at(0).requestsInfo[i].request_guid;
+    new_bc.requestsInfo[i].max_sequence_length =
+        old_batches.at(0).requestsInfo[i].max_sequence_length;
+    // TODO: Check this
+    new_bc.requestsInfo[i].num_tokens_in_batch = 0;
+    new_bc.request_completed[i] = false;
+
+    // Profiling
+    profiling_requests[new_bc.requestsInfo[i].request_guid].decoding_steps += 1;
+    // TODO: Add prompt token first in first verify iteration
+    if (request.tokens.size() == request.initial_len) {
+      // Initialization (prompt) phase
+      for (int j = 0; j < request.initial_len; j++) {
+        new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+        new_bc.tokensInfo[new_bc.num_tokens].token_id = request.tokens[j];
+        new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = j;
+
+        new_bc.num_tokens++;
+        new_bc.requestsInfo[i].num_tokens_in_batch++;
+      }
+
+      std::cout << "new_bc.num_tokens: " << new_bc.num_tokens << std::endl;
+      if (new_bc.num_tokens >= BatchConfig::MAX_NUM_TOKENS) {
+        assert(false &&
+               "Exceeding the space available in the TreeVerify batch");
+        break;
+      }
+
+      new_bc.requestsInfo[i].token_start_offset = 0;
+    } else {
+      // Incremental phase: only add the last committed token
+      new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+      new_bc.tokensInfo[new_bc.num_tokens].token_id = request.tokens.back();
+      new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request =
+          request.tokens.size() - 1;
+
+      new_bc.num_tokens++;
+      new_bc.requestsInfo[i].num_tokens_in_batch++;
+
+      if (new_bc.num_tokens == BatchConfig::MAX_NUM_TOKENS) {
+        assert(false &&
+               "Exceeding the space available in the TreeVerify batch");
+        break;
+      }
+
+      new_bc.requestsInfo[i].token_start_offset = request.tokens.size() - 1;
+    }
+
+    if (verbose) {
+      std::cout << "dfs_tree_inputs.size(): " << dfs_tree_inputs.size()
+                << std::endl;
+    }
+
+    // add prompt to the dfs tree
+    if (committed_tokens.find(guid) != committed_tokens.end()) {
+      if (dfs_tree_inputs.at(0).second ==
+          request.initial_len + committed_tokens.at(guid).size() - 1) {
+        for (int j = 0; j < request.initial_len; j++) {
+          new_bc.committed_tokens[new_bc.num_tokens_to_commit].token_index = j;
+          new_bc.committed_tokens[new_bc.num_tokens_to_commit].request_index =
+              i;
+          new_bc.committed_tokens[new_bc.num_tokens_to_commit].token_depth = j;
+          if (verbose) {
+            std::cout << new_bc.num_tokens_to_commit
+                      << "- committed_token.token_depth: " << j
+                      << ", token_index: " << j << std::endl;
+          }
+          new_bc.num_tokens_to_commit++;
+        }
+      } else {
+        // only add the root token
+        auto committed_token = committed_tokens.at(guid).at(0);
+        new_bc.committed_tokens[new_bc.num_tokens_to_commit].token_index =
+            committed_token.second;
+        new_bc.committed_tokens[new_bc.num_tokens_to_commit].request_index = i;
+        new_bc.committed_tokens[new_bc.num_tokens_to_commit].token_depth =
+            committed_token.first;
+        if (verbose) {
+          std::cout << new_bc.num_tokens_to_commit
+                    << "- committed_token.token_depth: "
+                    << committed_token.first
+                    << ", token_index: " << committed_token.second << std::endl;
+        }
+        new_bc.num_tokens_to_commit++;
+      }
+      if (verbose) {
+        std::cout << "new_bc.num_tokens_to_commit: "
+                  << new_bc.num_tokens_to_commit << std::endl;
+      }
+    }
+
+    // Token Info
+    for (int j = 1; j < dfs_tree_inputs.size(); j++) {
+      auto token = dfs_tree_inputs.at(j);
+      if (verbose) {
+        std::cout << "[" << j << "] Token: " << token.first
+                  << ", Depth:" << token.second << std::endl;
+      }
+      // Normal Token Info
+      new_bc.tokensInfo[new_bc.num_tokens].request_index = i;
+      new_bc.tokensInfo[new_bc.num_tokens].token_id = token.first;
+      new_bc.tokensInfo[new_bc.num_tokens].abs_depth_in_request = token.second;
+
+      // TODO: Add committed token info
+      if (verbose) {
+        std::cout << "committed_tokens.size(): " << new_bc.num_tokens_to_commit
+                  << std::endl;
+      }
+
+      if (committed_tokens.find(guid) != committed_tokens.end()) {
+        if (j < committed_tokens.at(guid).size()) {
+          auto committed_token = committed_tokens.at(guid).at(j);
+          new_bc.committed_tokens[new_bc.num_tokens_to_commit].token_index =
+              committed_token.second;
+          new_bc.committed_tokens[new_bc.num_tokens_to_commit].request_index =
+              i;
+          new_bc.committed_tokens[new_bc.num_tokens_to_commit].token_depth =
+              committed_token.first;
+          if (verbose) {
+            std::cout << new_bc.num_tokens_to_commit
+                      << "- committed_token.token_depth: "
+                      << committed_token.first
+                      << ", token_index: " << committed_token.second
+                      << std::endl;
+          }
+          new_bc.num_tokens_to_commit++;
+        }
+      }
+      if (verbose) {
+        std::cout << "new_bc.num_tokens_to_commit: "
+                  << new_bc.num_tokens_to_commit << std::endl;
+      }
+
+      new_bc.num_tokens++;
+      new_bc.requestsInfo[i].num_tokens_in_batch++;
+
+      if (new_bc.num_tokens == BatchConfig::MAX_NUM_TOKENS - 1) {
+        break;
+      }
+    }
+
+    std::cout << "new_bc.num_tokens: " << new_bc.num_tokens << std::endl;
+  }
+
+  if (verbose) {
+    std::cout << "prepare_next_batch_verify OLD vs NEW batchconfigs below:"
+              << std::endl;
+    // old_batches.print();
+    // new_bc.print();
+  }
+
+  return new_bc;
+}
+
+void RequestManager::store_beam_metadata(BeamSearchBatchConfig const &old_bc,
+                                         BeamInferenceResult const &result) {
+  // step1 store the outputs
+  if (old_bc.num_tokens <= 0) {
+    return;
+  }
+  auto guid =
+      old_bc.requestsInfo[old_bc.tokensInfo[0].request_index].request_guid;
+  auto start_depth = old_bc.tokensInfo[0].abs_depth_in_request;
+  int result_index = 0;
+
+  if (verbose) {
+    std::cout << "Store total of " << old_bc.num_tokens
+              << " tokens in the current batch.\n";
+  }
+
+  for (int i = 0; i <= old_bc.num_tokens; i++) {
+    int request_index = old_bc.tokensInfo[i].request_index;
+
+    // End of the request
+    if (i == old_bc.num_tokens ||
+        old_bc.requestsInfo[request_index].request_guid != guid) {
+
+      // Each token yields (beam_width) results
+      int beam_width = old_bc.beamRequestsInfo[request_index].beam_size;
+
+      // Count tokens sent to model in this request to find the final token's
+      // index
+      result_index +=
+          (old_bc.tokensInfo[i - 1].abs_depth_in_request - start_depth) *
+          beam_width;
+
+      if (verbose) {
+        std::cout << "i = " << i << ", result index = " << result_index
+                  << ", value: " << result.token_ids[result_index] << "\n";
+      }
+
+      int index = old_bc.tokensInfo[i - 1].request_index;
+      int beam_size = old_bc.beamRequestsInfo[index].beam_size;
+      int depth = old_bc.beamRequestsInfo[index].current_depth;
+
+      Request &request = all_requests[old_bc.requestsInfo[index].request_guid];
+
+      if (depth == 1) {
+        // store the last input into the tree;
+        if (verbose) {
+          std::cout << "try to store the input"
+                    << "\n";
+        }
+
+        request.beam_trees.at(old_bc.model_id).treeLayers[0].tokens[0] =
+            request.tokens.back();
+        request.beam_trees.at(old_bc.model_id).treeLayers[0].probs[0] = 1;
+        request.beam_trees.at(old_bc.model_id).treeLayers[0].parent_ids[0] = -1;
+
+        if (verbose) {
+          std::cout << "Store the previous last token to the tree root: "
+                    << request.tokens.back() << "\n";
+        }
+      }
+
+      for (int beam_id = 0; beam_id < beam_width; beam_id++) {
+        request.beam_trees.at(old_bc.model_id)
+            .treeLayers[depth]
+            .tokens[beam_id] = result.token_ids[result_index];
+        request.beam_trees.at(old_bc.model_id)
+            .treeLayers[depth]
+            .probs[beam_id] = result.probs[result_index];
+        request.beam_trees.at(old_bc.model_id)
+            .treeLayers[depth]
+            .parent_ids[beam_id] = result.parent_id[result_index];
+
+        if (verbose) {
+          std::cout << "tree value: " << depth << "token: "
+                    << request.beam_trees.at(old_bc.model_id)
+                           .treeLayers[depth]
+                           .tokens[beam_id]
+                    << "result tokens: " << result.token_ids[result_index];
+        }
+        result_index += 1;
+      }
+
+      // update the guid and start_depth for current request
+      if (i < old_bc.num_tokens) {
+        guid = old_bc.requestsInfo[request_index].request_guid;
+        start_depth = old_bc.tokensInfo[i].abs_depth_in_request;
+      }
+    }
+  }
+}
+
+// for updating the beam search metadata in requests in incremental phase
+void RequestManager::update_beam_metadata(BeamSearchBatchConfig &new_bc,
+                                          BeamTree &tree,
+                                          int request_index) {
+
+  // do the exchange
+  if (new_bc.request_completed[request_index]) {
+    assert(false);
+  }
+  int depth = new_bc.beamRequestsInfo[request_index].current_depth - 1;
+  int beam_size = new_bc.beamRequestsInfo[request_index].beam_size;
+
+  if (new_bc.beamRequestsInfo[request_index].current_depth ==
+      1) { // TODO: check if this is correct
+    // for (int j = 0; j < beam_size; j++) {
+    //   new_bc.beamRequestsInfo[request_index].parent_id[j] = j;
+    //   new_bc.beamRequestsInfo[request_index].probs[j] =
+    //       tree.treeLayers[depth].probs[j]; // ?
+    //   new_bc.beamRequestsInfo[request_index].tokens[j] =
+    //       tree.treeLayers[depth].tokens[j]; // ?
+    // }
+    // Do nothing
+    // assert(false);
+  } else {
+    std::set<int> parents;
+    std::set<int> childs;
+    // cache stealing
+    for (int j = 0; j < beam_size; j++) {
+      int parent_id = tree.treeLayers[depth].parent_ids[j];
+      if (childs.find(parent_id) == childs.end()) {
+        // copy beam slot
+        new_bc.beamRequestsInfo[request_index].parent_id[parent_id] =
+            tree.treeLayers[depth].parent_ids[j];
+        new_bc.beamRequestsInfo[request_index].probs[parent_id] =
+            tree.treeLayers[depth].probs[j];
+        new_bc.beamRequestsInfo[request_index].tokens[parent_id] =
+            tree.treeLayers[depth].tokens[j];
+        parents.emplace(j);
+        childs.emplace(parent_id);
+      }
+    }
+    if (parents.size() < beam_size) {
+      for (int j = 0; j < beam_size; j++) {
+        if (parents.find(j) == parents.end()) {
+          // this slot has not been assigned
+          // find the smallest not assigned child and put in
+          if (verbose) {
+            std::cout << "request_index" << request_index
+                      << ", miss slot: " << j << "\n";
+          }
+          for (int k = 0; k < beam_size; k++) {
+            if (childs.find(k) == childs.end()) {
+              // parent -> j to child k;
+              new_bc.beamRequestsInfo[request_index].parent_id[k] =
+                  tree.treeLayers[depth].parent_ids[j];
+              new_bc.beamRequestsInfo[request_index].probs[k] =
+                  tree.treeLayers[depth].probs[j];
+              new_bc.beamRequestsInfo[request_index].tokens[k] =
+                  tree.treeLayers[depth].tokens[j];
+              parents.emplace(j);
+              childs.emplace(k);
+              break;
+            }
+          }
+        }
+      }
+    }
+  }
+  if (verbose) {
+    std::cout << "-----------after parent id exchange-----------" << std::endl;
+    for (int j = 0; j < beam_size; j++) {
+      std::cout << "after request id: " << request_index << "beam id = " << j
+                << "parent: "
+                << new_bc.beamRequestsInfo[request_index].parent_id[j]
+                << "token: " << new_bc.beamRequestsInfo[request_index].tokens[j]
+                << "probs: " << new_bc.beamRequestsInfo[request_index].probs[j]
+                << std::endl;
+    }
+  }
+}
+
+bool PreOrder(
+    BeamTree const &tree,
+    int max_depth,
+    int current_depth,
+    int beam_width,
+    int id,
+    std::vector<std::pair<BeamSearchBatchConfig::TokenId, int>> &serializedTree,
+    bool verbose) {
+  // terminate
+  if (current_depth >= max_depth) {
+    serializedTree.push_back(std::make_pair(
+        tree.treeLayers[current_depth].tokens[id], current_depth));
+    if (verbose) {
+      std::cout << "last tokens: " << tree.treeLayers[current_depth].tokens[id]
+                << "\n";
+      std::cout << "return true"
+                << "\n";
+    }
+    return true;
+  }
+
+  // add to tree;
+  // std::cout<<"node: " << current_depth << ", id: " <<
+  serializedTree.push_back(
+      std::make_pair(tree.treeLayers[current_depth].tokens[id], current_depth));
+  if (verbose) {
+    std::cout << "push something: " << tree.treeLayers[current_depth].tokens[id]
+              << ", " << current_depth << std::endl;
+  }
+  int index = serializedTree.size() - 1;
+  int next_layers = current_depth + 1;
+
+  bool flag = false;
+  // recursion
+  for (int i = 0; i < beam_width; i++) {
+    int child_id = i;
+    int child_parent = tree.treeLayers[next_layers].parent_ids[i];
+
+    // for all childs, do preOrder
+    if (child_parent == id) {
+      if (verbose) {
+        std::cout << "current depth: " << current_depth << ", child_parent, "
+                  << child_parent << ", child_id, " << child_id << "\n";
+      }
+      bool res = PreOrder(tree,
+                          max_depth,
+                          current_depth + 1,
+                          beam_width,
+                          child_id,
+                          serializedTree,
+                          verbose);
+      flag = flag || res;
+    }
+  }
+  // if (!flag) {
+  //   // no child for this token, delete it
+  //   std::cout << "delete a node: " <<
+  //   tree.treeLayers[current_depth].tokens[id]
+  //             << ", " << current_depth << std::endl;
+  //   serializedTree.erase(serializedTree.begin() + index);
+  // }
+  return flag;
+}
+
+std::vector<std::pair<BatchConfig::TokenId, int>>
+    RequestManager::traverse_verify_tree(
+        size_t guid,
+        std::vector<std::pair<BatchConfig::TokenId, int>> const
+            &inputSerializedTree,
+        std::vector<std::pair<BatchConfig::TokenId, int>> const
+            &outputSerializedTree) {
+  std::vector<std::pair<BeamSearchBatchConfig::TokenId, int>> verifiedTree;
+  // verifiedTree.push_back(inputSerializedTree.at(0));
+  std::vector<std::pair<int, int>> new_committed_tokens =
+      std::vector<std::pair<int, int>>();
+
+  log_req_mgr.print("Input tree size (%zu) Output tree size (%zu)",
+                    inputSerializedTree.size(),
+                    outputSerializedTree.size());
+  { // Input tree
+    std::ostringstream oss;
+    // inputSerializedTree is the dfs_tree_inputs_map[guid] array og (token id,
+    // depth) pairs
+    for (auto const &pair : inputSerializedTree) {
+      oss << " " << pair.second << ":" << pair.first;
+      // log_req_mgr.print("(%d, %d)", pair.first, pair.second);
+    }
+    log_req_mgr.print("Input tree:%s", oss.str().c_str());
+  }
+  { // Output tree
+    // log_req_mgr.print("========Output============");
+    // outputSerializedTree is an array of (token id, depth + 1) pairs
+    std::ostringstream oss;
+    for (auto const &pair : outputSerializedTree) {
+      // log_req_mgr.print("(%d, %d)", pair.first, pair.second);
+      oss << " " << pair.second << ":" << pair.first;
+    }
+    log_req_mgr.print("Output tree:%s", oss.str().c_str());
+  }
+  {
+    // log_req_mgr.print("========Committed============");
+    //  committed_tokens[guid] is an array of (depth, result_index) pairs for
+    //  the given request
+    std::ostringstream oss;
+    for (auto const &pair : committed_tokens.at(guid)) {
+      // log_req_mgr.print("(%d, %d)", pair.first, pair.second);
+      oss << " " << pair.second << ":" << pair.first;
+    }
+    log_req_mgr.print("Committed tokens:%s", oss.str().c_str());
+  }
+
+  // It's safe to have inputSerializedTree.size() > outputSerializedTree.size()
+  // In this case the inputSeriedTree ends with padding 0s
+  assert(inputSerializedTree.size() >= outputSerializedTree.size());
+
+  for (int i = 0; i < outputSerializedTree.size(); i++) {
+    auto input = inputSerializedTree.at(i);
+    auto output = outputSerializedTree.at(i);
+
+    if (i == 0) {
+      verifiedTree.push_back(output);
+      new_committed_tokens.push_back(std::make_pair(
+          input.second,
+          committed_tokens.at(guid).at(i).second)); // <input_abs_depth,
+                                                    // input_index_in_batch>
+      // std::cout << committed_tokens.at(guid).at(i).first << ", "
+      //           << committed_tokens.at(guid).at(i).second << std::endl;
+      // std::cout << input.first << ", " << input.second << std::endl;
+
+      assert(committed_tokens.at(guid).at(i).first == input.second);
+      continue;
+    }
+
+    if (input.first == verifiedTree.back().first &&
+        input.second == verifiedTree.back().second) {
+      verifiedTree.push_back(output);
+      new_committed_tokens.push_back(std::make_pair(
+          input.second,
+          committed_tokens.at(guid).at(i).second)); // <input_abs_depth,
+                                                    // input_index_in_batch>
+      assert(committed_tokens.at(guid).at(i).first == input.second);
+    }
+  }
+  committed_tokens[guid] = new_committed_tokens;
+  {
+    // log_req_mgr.print("========Verified============");
+    std::ostringstream oss;
+    for (auto const &pair : verifiedTree) {
+      // log_req_mgr.print("(%d, %d)", pair.first, pair.second);
+      oss << " " << pair.second << ":" << pair.first;
+    }
+    log_req_mgr.print("Verified:%s", oss.str().c_str());
+  }
+  {
+    // log_req_mgr.print("========New Committed============");
+    std::ostringstream oss;
+    for (auto const &pair : committed_tokens.at(guid)) {
+      // log_req_mgr.print("(%d, %d)", pair.first, pair.second);
+      oss << " " << pair.second << ":" << pair.first;
+    }
+    log_req_mgr.print("New committed:%s", oss.str().c_str());
+  }
+
+  return verifiedTree;
+}
+
+std::vector<std::pair<BatchConfig::TokenId, int>>
+    RequestManager::traverse_beam_tree(BeamSearchBatchConfig const &old_bc,
+                                       int request_index,
+                                       int token_start_offset) {
+  if (verbose) {
+    std::cout << "[Traverse Beam Tree] request_index: " << request_index
+              << "\n";
+    std::cout << "[Traverse Beam Tree] max_depth: "
+              << old_bc.beamRequestsInfo[request_index].max_depth << "\n";
+    std::cout << "[Traverse Beam Tree] current_depth: "
+              << old_bc.beamRequestsInfo[request_index].current_depth << "\n";
+    std::cout << "[Traverse Beam Tree] beam_width: "
+              << old_bc.beamRequestsInfo[request_index].beam_size << "\n";
+  }
+
+  auto guid = old_bc.requestsInfo[request_index].request_guid;
+  Request &request = all_requests[guid];
+  // std::cout << "request.beam_trees.size(): " << request.beam_trees.size()
+  //           << std::endl;
+  BeamTree tree = request.beam_trees.at(old_bc.model_id);
+  // std::cout << "\n\n";
+
+  // token, index
+  // todo make this one global for different stages
+  std::vector<std::pair<BatchConfig::TokenId, int>> serializedTree;
+  PreOrder(tree,
+           old_bc.beamRequestsInfo[request_index].max_depth,
+           0,
+           old_bc.beamRequestsInfo[request_index].beam_size,
+           0,
+           serializedTree,
+           verbose);
+
+  // print it
+  if (verbose) {
+    std::cout << "Print serialized tree: size:" << request_index
+              << serializedTree.size() << "\n";
+  }
+  for (int k = 0; k < serializedTree.size(); k++) {
+    serializedTree.at(k).second += token_start_offset;
+    if (verbose) {
+      std::cout << "token id: " << serializedTree.at(k).first
+                << ", depth: " << serializedTree.at(k).second << "\n";
+    }
+  }
+
+  // if (dfs_tree_inputs.find(old_bc.requestsInfo[request_index].request_guid)
+  // !=
+  //     dfs_tree_inputs.end()) {
+  //   dfs_tree_inputs[old_bc.requestsInfo[request_index].request_guid] =
+  //       serializedTree;
+  // } else {
+  //   dfs_tree_inputs.insert(std::make_pair(
+  //       old_bc.requestsInfo[request_index].request_guid, serializedTree));
+  // }
+
+  return serializedTree;
+  // }
+}
+
+std::vector<std::pair<BatchConfig::TokenId, int>>
+    RequestManager::merge_dfs_trees(
+        std::vector<std::vector<std::pair<BatchConfig::TokenId, int>>>
+            input_trees,
+        int root_depth,
+        RequestGuid guid) {
+  std::vector<std::pair<BatchConfig::TokenId, int>> merged_tree;
+
+  std::unordered_map<int, std::set<int>> childrens;
+  std::unordered_map<int, int> curr_path;
+
+  // convert <token_id, depth> pair to an integer
+  auto root = input_trees.at(0).at(0);
+  int root_id = root.first * 10000 + root.second;
+
+  for (int i = 0; i < input_trees.size(); i++) {
+    auto tree = input_trees.at(i);
+    // all trees should have the same root
+    assert(tree.at(0) == root);
+
+    for (auto const &pair : tree) {
+      int id = pair.first * 10000 + pair.second; // current node
+      curr_path[pair.second] = id;               // log node in current search
+
+      if (childrens.find(id) == childrens.end()) {
+        // init empty set
+        childrens[id] = std::set<int>();
+      }
+
+      if (pair.second > root_depth) {
+        int parent_id = curr_path[pair.second - 1];
+        childrens[parent_id].insert(id);
+      }
+    }
+  }
+
+  std::stack<int> q;
+  q.push(root_id);
+
+  while (!q.empty()) {
+    int curr = q.top();
+    q.pop();
+    merged_tree.push_back(std::make_pair(curr / 10000, curr % 10000));
+    for (int child : childrens[curr]) {
+      q.push(child);
+    }
+  }
+
+  if (verbose) {
+    for (auto &pair : merged_tree) {
+      std::cout << pair.first << ", depth: " << pair.second << std::endl;
+    }
+  }
+
+  dfs_tree_inputs[guid] = merged_tree;
+
+  return merged_tree;
+}
+
+GenerationResult FFModel::generate(std::string const &text,
+                                   int max_seq_length) {
+  RequestManager *rm = RequestManager::get_request_manager();
+  if (rm->get_num_ssms() == 0) {
+    // No SSMs: perform incremental decoding
+    return rm->generate_incr_decoding(this, text, max_seq_length);
+  } else {
+    // Registered SSMs: perform speculative inference
+    return rm->generate_spec_infer(this, text, max_seq_length);
+  }
+}
+
+/*static*/
+GenerationResult RequestManager::generate_incr_decoding(FFModel *llm,
+                                                        std::string const &text,
+                                                        int max_seq_length) {
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  RequestGuid guid = register_new_request(text, max_seq_length);
+  if (guid == 0) {
+    std::cout
+        << "=========== Discard request exceed prompt maximum... ==========="
+        << std::endl;
+    return GenerationResult();
+  }
+
+  int tokens_to_generate = max_seq_length - all_requests[guid].tokens.size();
+  std::queue<std::pair<BatchConfigFuture, InferenceResultFuture>>
+      batch_pipeline;
+  { batch_pipeline.push(std::make_pair(last_bcf, last_irf)); }
+  while (!is_request_completed(guid)) {
+    if (batch_pipeline.size() >= 4) {
+      // Block here to avoid launching too many batches
+      auto const &batch = batch_pipeline.front();
+      batch.second.get_void_result();
+    }
+    // deque finished batches
+    while (batch_pipeline.size() > 1) {
+      auto const &batch = batch_pipeline.front();
+      if (batch.second.is_ready()) {
+        batch_pipeline.pop();
+      } else {
+        break;
+      }
+    }
+    if (is_request_completed(guid)) {
+      break;
+    }
+    Runtime *runtime = Runtime::get_runtime();
+    Context ctx = Runtime::get_context();
+    runtime->begin_trace(ctx, 12346 /*trace_id*/);
+    auto const &next_batch = batch_pipeline.back();
+    BatchConfigFuture bcf =
+        prepare_next_batch(next_batch.first, next_batch.second);
+    FutureMap fm = im->inference(llm, 0, bcf);
+    assert(fm.get_future_map_domain().get_volume() == 1);
+    InferenceResultFuture irf = fm.get_future(0);
+    batch_pipeline.push(std::make_pair(bcf, irf));
+    last_bcf = bcf;
+    last_irf = irf;
+    runtime->end_trace(ctx, 12346 /*trace_id*/);
+  }
+  GenerationResult gr = get_generation_result(guid);
+  // assert(gr.output_tokens.size() >= max_seq_length);
+  return gr;
+}
+
+/*static*/
+GenerationResult RequestManager::generate_spec_infer(FFModel *llm,
+                                                     std::string const &text,
+                                                     int max_seq_length) {
+  InferenceManager *im = InferenceManager::get_inference_manager();
+  RequestGuid guid = register_new_request(text, max_seq_length);
+  if (guid == 0) {
+    std::cout
+        << "=========== Discard request exceed prompt maximum... ==========="
+        << std::endl;
+    return GenerationResult();
+  }
+
+  std::queue<std::pair<TreeVerifyBatchConfigFuture, InferenceResultFuture>>
+      batch_pipeline;
+  batch_pipeline.push(std::make_pair(last_tree_bcf, last_tree_irf));
+  while (!is_request_completed(guid)) {
+    if (batch_pipeline.size() >= 4) {
+      // Block here to avoid launching too many batches
+      auto const &batch = batch_pipeline.front();
+      batch.second.get_void_result();
+    }
+    // deque finished batches
+    while (batch_pipeline.size() > 1) {
+      auto const &batch = batch_pipeline.front();
+      if (batch.second.is_ready()) {
+        batch_pipeline.pop();
+      } else {
+        break;
+      }
+    }
+    auto const &next_batch = batch_pipeline.back();
+    BeamSearchBatchConfigFuture beam_bcf =
+        prepare_next_batch_init(next_batch.first, next_batch.second, 0);
+    std::vector<BeamSearchBatchConfigFuture> beam_bcf_vec(get_num_ssms());
+    for (size_t ssm_id = 0; ssm_id < get_num_ssms(); ssm_id++) {
+      beam_bcf_vec[ssm_id] = beam_bcf;
+    }
+    // if (is_request_completed(guid)) {
+    //   break;
+    // }
+    Runtime *runtime = Runtime::get_runtime();
+    Context ctx = Runtime::get_context();
+    runtime->begin_trace(ctx, 12345 /*trace_id*/);
+
+    for (size_t i = 0; i < get_num_ssms(); i++) {
+      for (int depth = 0; depth < BeamSearchBatchConfig::MAX_BEAM_DEPTH;
+           depth++) {
+        beam_bcf = beam_bcf_vec[i];
+
+        FutureMap fm = im->inference(get_model(i), 0, beam_bcf_vec[i]);
+        assert(fm.get_future_map_domain().get_volume() == 1);
+        BeamInferenceResultFuture beam_irf = fm.get_future(0);
+        beam_bcf_vec[i] = prepare_next_batch_beam(beam_bcf_vec[i], beam_irf);
+      }
+    }
+    // Token Tree Verification
+    {
+      TreeVerifyBatchConfigFuture tree_bcf =
+          prepare_next_batch_verify(beam_bcf_vec);
+      FutureMap fm = im->inference(llm, 0, tree_bcf);
+      assert(fm.get_future_map_domain().get_volume() == 1);
+      InferenceResultFuture tree_irf = fm.get_future(0);
+      batch_pipeline.push(std::make_pair(tree_bcf, tree_irf));
+      last_tree_bcf = tree_bcf;
+      last_tree_irf = tree_irf;
+    }
+    runtime->end_trace(ctx, 12345 /*trace_id*/);
+  }
+
+  GenerationResult gr = get_generation_result(guid);
+  // assert(gr.output_tokens.size() >= max_seq_length);
+  return gr;
+}
+
+RequestManager *request_manager_singleton = nullptr;
+
+/*static*/
+RequestManager *RequestManager::get_request_manager() {
+  if (request_manager_singleton == nullptr) {
+    request_manager_singleton = new RequestManager();
+  }
+  return request_manager_singleton;
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/request_manager.cpp b/src/runtime/request_manager.cpp
new file mode 100644
index 0000000000..80554c2add
--- /dev/null
+++ b/src/runtime/request_manager.cpp
@@ -0,0 +1,78 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/request_manager.h"
+#include "flexflow/utils/hip_helper.h"
+#include <hip/hip_runtime.h>
+
+namespace FlexFlow {
+
+using namespace Legion;
+
+void RequestManager::load_tokens_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(regions.size() == 1);
+  assert(task->regions.size() == 1);
+
+  BatchConfig const batch_config = *((BatchConfig *)task->args);
+  BatchConfig::TokenId dram_copy[BatchConfig::MAX_NUM_TOKENS];
+  for (int i = 0; i < batch_config.num_tokens; i++) {
+    dram_copy[i] = batch_config.tokensInfo[i].token_id;
+  }
+  TokenId *fb_ptr = helperGetTensorPointerWO<TokenId>(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  Domain domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  assert(batch_config.num_tokens <= domain.get_volume());
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDA(hipMemcpyAsync(fb_ptr,
+                           dram_copy,
+                           sizeof(TokenId) * batch_config.num_tokens,
+                           hipMemcpyHostToDevice,
+                           stream));
+}
+
+void RequestManager::load_positions_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(regions.size() == 1);
+  assert(task->regions.size() == 1);
+  BatchConfig const batch_config = *((BatchConfig *)task->args);
+  int offset = 2;
+  int *pos_ptr = helperGetTensorPointerWO<int>(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  Domain domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  int dram_copy[BatchConfig::MAX_NUM_TOKENS];
+
+  for (int i = 0; i < batch_config.num_tokens; i++) {
+    dram_copy[i] = batch_config.tokensInfo[i].abs_depth_in_request + offset;
+  }
+  hipStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDA(hipMemcpyAsync(pos_ptr,
+                           dram_copy,
+                           sizeof(int) * batch_config.num_tokens,
+                           hipMemcpyHostToDevice,
+                           stream));
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/request_manager.cu b/src/runtime/request_manager.cu
new file mode 100644
index 0000000000..58e996629e
--- /dev/null
+++ b/src/runtime/request_manager.cu
@@ -0,0 +1,92 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/request_manager.h"
+#include "flexflow/utils/cuda_helper.h"
+
+namespace FlexFlow {
+
+using namespace Legion;
+
+void RequestManager::load_tokens_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(regions.size() == 1);
+  assert(task->regions.size() == 1);
+
+  // BatchConfig const batch_config = *((BatchConfig *)task->args);
+  BatchConfig const *batch_config = BatchConfig::from_future(task->futures[0]);
+  BatchConfig::TokenId dram_copy[BatchConfig::MAX_NUM_TOKENS];
+
+  // Extreme long prompts are not supported, only load up to MAX_NUM_TOKENS as
+  // prompt
+  if (batch_config->num_tokens > BatchConfig::MAX_NUM_TOKENS) {
+    printf("Warning: too many tokens in prompt, only load up to %d tokens\n",
+           BatchConfig::MAX_NUM_TOKENS);
+    printf("Got: %d tokens\n", batch_config->num_tokens);
+  }
+  // assert(batch_config->num_tokens <= BatchConfig::MAX_NUM_TOKENS);
+
+  for (int i = 0; i < batch_config->num_tokens; i++) {
+    dram_copy[i] = batch_config->tokensInfo[i].token_id;
+  }
+  TokenId *fb_ptr = helperGetTensorPointerWO<TokenId>(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  Domain domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  assert(batch_config->num_tokens <= domain.get_volume());
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDA(cudaMemcpyAsync(fb_ptr,
+                            dram_copy,
+                            sizeof(TokenId) * batch_config->num_tokens,
+                            cudaMemcpyHostToDevice,
+                            stream));
+}
+
+void RequestManager::load_positions_task(
+    Task const *task,
+    std::vector<PhysicalRegion> const &regions,
+    Context ctx,
+    Runtime *runtime) {
+  assert(regions.size() == 1);
+  assert(task->regions.size() == 1);
+
+  // BatchConfig const batch_config = *((BatchConfig *)task->args);
+  BatchConfig const *batch_config = BatchConfig::from_future(task->futures[0]);
+
+  int const offset = *((int const *)task->args);
+  int *pos_ptr = helperGetTensorPointerWO<int>(
+      regions[0], task->regions[0], FID_DATA, ctx, runtime);
+  Domain domain = runtime->get_index_space_domain(
+      ctx, task->regions[0].region.get_index_space());
+  int dram_copy[BatchConfig::MAX_NUM_TOKENS];
+
+  for (int i = 0; i < batch_config->num_tokens; i++) {
+    dram_copy[i] = batch_config->tokensInfo[i].abs_depth_in_request + offset;
+  }
+
+  cudaStream_t stream;
+  checkCUDA(get_legion_stream(&stream));
+  checkCUDA(cudaMemcpyAsync(pos_ptr,
+                            dram_copy,
+                            sizeof(int) * batch_config->num_tokens,
+                            cudaMemcpyHostToDevice,
+                            stream));
+}
+
+}; // namespace FlexFlow
diff --git a/src/runtime/simulator.cc b/src/runtime/simulator.cc
index c363cdd296..d943376416 100644
--- a/src/runtime/simulator.cc
+++ b/src/runtime/simulator.cc
@@ -14,6 +14,7 @@
  */
 
 #include "flexflow/simulator.h"
+#include "flexflow/ffconst_utils.h"
 #include "flexflow/model.h"
 #include "flexflow/parallel_ops/combine.h"
 #include "flexflow/parallel_ops/partition.h"
@@ -349,25 +350,6 @@ void Simulator::free_all() {
   offset = 0;
 }
 
-size_t data_type_size(DataType type) {
-  switch (type) {
-    case DT_HALF:
-      return sizeof(half);
-    case DT_FLOAT:
-      return sizeof(float);
-    case DT_DOUBLE:
-      return sizeof(double);
-    case DT_INT32:
-      return sizeof(int32_t);
-    case DT_INT64:
-      return sizeof(int64_t);
-    case DT_BOOLEAN:
-      return sizeof(bool);
-    default:
-      assert(false);
-  }
-}
-
 void *Simulator::allocate(size_t num_elements, DataType type) {
   size_t element_size = data_type_size(type);
   void *ret_ptr = base_ptr + offset;
diff --git a/src/runtime/simulator.cpp b/src/runtime/simulator.cpp
index f1d076b4c9..e10923cd8d 100644
--- a/src/runtime/simulator.cpp
+++ b/src/runtime/simulator.cpp
@@ -83,10 +83,10 @@ Simulator::Simulator(FFModel const *model,
   hipEventCreate(&start_event);
   hipEventCreate(&end_event);
   conv2d_meta = new Conv2DMeta(handler);
-  linear_meta = new LinearMeta(handler, 4096);
+  // linear_meta = new LinearMeta(handler, 4096);
   pool2d_meta = new Pool2DMeta(handler);
   ele_unary_meta = new ElementUnaryMeta(handler);
-  ele_binary_meta = new ElementBinaryMeta(handler);
+  // ele_binary_meta = new ElementBinaryMeta(handler);
   // embedding_meta = new EmbeddingMeta(handler);
   //  softmax_meta = new SoftmaxMeta(handler);
   batch_matmul_meta = new BatchMatmulMeta(handler);
diff --git a/src/runtime/simulator.cu b/src/runtime/simulator.cu
index 8f109d0edb..b44ce1690a 100644
--- a/src/runtime/simulator.cu
+++ b/src/runtime/simulator.cu
@@ -82,10 +82,10 @@ Simulator::Simulator(FFModel const *model,
   cudaEventCreate(&start_event);
   cudaEventCreate(&end_event);
   conv2d_meta = new Conv2DMeta(handler);
-  linear_meta = new LinearMeta(handler, 4096);
+  // linear_meta = new LinearMeta(handler, 4096);
   pool2d_meta = new Pool2DMeta(handler);
   ele_unary_meta = new ElementUnaryMeta(handler);
-  ele_binary_meta = new ElementBinaryMeta(handler);
+  // ele_binary_meta = new ElementBinaryMeta(handler);
   // embedding_meta = new EmbeddingMeta(handler);
   // softmax_meta = new SoftmaxMeta(handler);
   batch_matmul_meta = new BatchMatmulMeta(handler);
@@ -106,7 +106,6 @@ Simulator::~Simulator(void) {
   delete conv2d_meta;
   delete pool2d_meta;
   delete ele_unary_meta;
-  delete ele_binary_meta;
   delete batch_matmul_meta;
   delete concat_meta;
   delete transpose_meta;
diff --git a/src/runtime/substitution.cc b/src/runtime/substitution.cc
index f852acaa6b..3a25d99b6f 100644
--- a/src/runtime/substitution.cc
+++ b/src/runtime/substitution.cc
@@ -26,12 +26,17 @@
 #include "flexflow/ops/element_binary.h"
 #include "flexflow/ops/element_unary.h"
 #include "flexflow/ops/embedding.h"
+#include "flexflow/ops/experts.h"
 #include "flexflow/ops/flat.h"
+#include "flexflow/ops/inc_multihead_self_attention.h"
 #include "flexflow/ops/linear.h"
 #include "flexflow/ops/noop.h"
 #include "flexflow/ops/pool_2d.h"
+#include "flexflow/ops/rms_norm.h"
 #include "flexflow/ops/softmax.h"
 #include "flexflow/ops/split.h"
+#include "flexflow/ops/tree_inc_multihead_self_attention.h"
+#include "flexflow/parallel_ops/allreduce.h"
 #include "flexflow/parallel_ops/combine.h"
 #include "flexflow/parallel_ops/fused_parallel_op.h"
 #include "flexflow/parallel_ops/partition.h"
@@ -893,8 +898,11 @@ bool GraphXfer::create_new_operator(OpX const *opx, Node &op) {
     case OP_EW_MUL:
     case OP_EW_MAX:
     case OP_EW_MIN: {
+      ElementBinaryParams params;
+      params.type = opx->type;
+      params.inplace_a = false;
       op = model->get_or_create_node<ElementBinary>({inputs[0], inputs[1]},
-                                                    {opx->type});
+                                                    params);
       break;
     }
     case OP_RELU: {
@@ -3651,6 +3659,13 @@ bool FFModel::convert_graph_to_operators(
         new_op = new Aggregate(*this, inputs, aggr->n, aggr->lambda_bal, NULL);
         break;
       }
+      case OP_EXPERTS: {
+        Experts *exp = (Experts *)node.ptr;
+        ExpertsParams params = exp->get_params();
+        new_op = new Experts(
+            *this, params, {std::begin(inputs), std::end(inputs)}, true);
+        break;
+      }
       case OP_SPLIT: {
         Split *split = (Split *)node.ptr;
         std::vector<int> splits;
@@ -3671,8 +3686,13 @@ bool FFModel::convert_graph_to_operators(
       case OP_EW_MIN: {
         assert(inList.size() == 2);
         ElementBinary *eb = (ElementBinary *)node.ptr;
-        new_op = new ElementBinary(
-            *this, eb->op_type, inputs[0], inputs[1], eb->inplace_a, NULL);
+        new_op = new ElementBinary(*this,
+                                   eb->layer_guid,
+                                   eb->op_type,
+                                   inputs[0],
+                                   inputs[1],
+                                   eb->inplace_a,
+                                   NULL);
         break;
       }
       case OP_POOL2D: {
@@ -3697,6 +3717,25 @@ bool FFModel::convert_graph_to_operators(
         new_op = new MultiHeadAttention(
             *this, *attn, inputs[0], inputs[1], inputs[2], true);
         break;
+      }
+      case OP_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(inList.size() == 1);
+        IncMultiHeadSelfAttention *attn = (IncMultiHeadSelfAttention *)node.ptr;
+        new_op = new IncMultiHeadSelfAttention(*this, *attn, inputs[0], true);
+        break;
+      }
+      case OP_TREE_INC_MULTIHEAD_SELF_ATTENTION: {
+        assert(inList.size() == 1);
+        TreeIncMultiHeadSelfAttention *attn =
+            (TreeIncMultiHeadSelfAttention *)node.ptr;
+        new_op =
+            new TreeIncMultiHeadSelfAttention(*this, *attn, inputs[0], true);
+        break;
+      }
+      case OP_RMS_NORM: {
+        assert(inList.size() == 1);
+        RMSNorm *rms = (RMSNorm *)node.ptr;
+        new_op = new RMSNorm(*this, *rms, inputs[0], true);
         break;
       }
       case OP_SOFTMAX: {
@@ -3739,6 +3778,12 @@ bool FFModel::convert_graph_to_operators(
                                reduction->reduction_degree);
         break;
       }
+      case OP_ALLREDUCE: {
+        assert(inList.size() == 1);
+        AllReduce *allreduce = (AllReduce *)node.ptr;
+        new_op = new AllReduce(*this, inputs[0], allreduce->allreduce_dim);
+        break;
+      }
       case OP_FUSED_PARALLEL: {
         assert(inList.size() == 1);
         FusedParallelOp *fused = (FusedParallelOp *)node.ptr;
diff --git a/src/runtime/tree_verify_batch_config.cc b/src/runtime/tree_verify_batch_config.cc
new file mode 100644
index 0000000000..78eff184c4
--- /dev/null
+++ b/src/runtime/tree_verify_batch_config.cc
@@ -0,0 +1,83 @@
+/* Copyright 2023 CMU, Stanford, Facebook, LANL
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "flexflow/batch_config.h"
+#include "legion.h"
+#include <cassert>
+#include <climits>
+
+namespace FlexFlow {
+
+LegionRuntime::Logger::Category log_tree_bc("TreeVerifyBatchConfig");
+
+TreeVerifyBatchConfig::TreeVerifyBatchConfig() : BatchConfig() {}
+
+TreeVerifyBatchConfig::~TreeVerifyBatchConfig() {}
+
+InferenceMode TreeVerifyBatchConfig::get_mode() const {
+  return TREE_VERIFY_MODE;
+}
+
+void TreeVerifyBatchConfig::print() const {
+  std::cout << "@@@@@@@@@@@@@@ TreeVerifyBatchConfig (mode " << get_mode()
+            << ") @@@@@@@@@@@@@@" << std::endl;
+  std::cout << "Max number of requests: " << MAX_NUM_REQUESTS << std::endl;
+  std::cout << "Max number of tokens: " << MAX_NUM_TOKENS << std::endl;
+  std::cout << "Number of tokens: " << num_tokens << std::endl;
+  std::cout << "Number of requests: " << num_active_requests() << std::endl;
+  // std::cout << "Cached results: " << cached_results << std::endl;
+
+  std::cout << "Per-request info:\n";
+  for (int i = 0; i < MAX_NUM_REQUESTS; i++) {
+    if (!request_completed[i]) {
+      std::cout << "  Request " << i << ":\n";
+      std::cout << "    Token start offset: "
+                << requestsInfo[i].token_start_offset << std::endl;
+      std::cout << "    Number of tokens in batch: "
+                << requestsInfo[i].num_tokens_in_batch << std::endl;
+      std::cout << "    GUID: " << requestsInfo[i].request_guid << std::endl;
+      std::cout << "    Max sequence length: "
+                << requestsInfo[i].max_sequence_length << std::endl;
+      std::cout << "    Request completed: " << request_completed[i]
+                << std::endl;
+    }
+  }
+
+  std::cout << "Per-token info:\n";
+  for (int i = 0; i < num_tokens; i++) {
+    std::cout << "  Token " << i << ":\n";
+    std::cout << "    Absolute depth in request: "
+              << tokensInfo[i].abs_depth_in_request << std::endl;
+    std::cout << "    Request index: " << tokensInfo[i].request_index
+              << std::endl;
+    std::cout << "    Token id: " << tokensInfo[i].token_id << std::endl;
+  }
+
+  std::cout << "Tokens to commit info:\n";
+  for (int i = 0; i < num_tokens_to_commit; i++) {
+    std::cout << "  Token " << i << ":\n";
+    std::cout << "    token_index: " << committed_tokens[i].token_index
+              << std::endl;
+    std::cout << "    request_index: " << committed_tokens[i].request_index
+              << std::endl;
+    std::cout << "    token_depth: " << committed_tokens[i].token_depth
+              << std::endl;
+  }
+
+  std::cout << "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
+            << std::endl;
+}
+
+}; // namespace FlexFlow
diff --git a/tests/.gitignore b/tests/.gitignore
new file mode 100644
index 0000000000..f3732d54f4
--- /dev/null
+++ b/tests/.gitignore
@@ -0,0 +1 @@
+inference/python_test_configs/*.json
diff --git a/tests/align/align_create_tensor_torch.py b/tests/align/align_create_tensor_torch.py
index 8b835a5276..ca1be143ed 100644
--- a/tests/align/align_create_tensor_torch.py
+++ b/tests/align/align_create_tensor_torch.py
@@ -2,7 +2,6 @@
 import sys
 
 import torch
-
 sys.path.append("./align/")
 from align_utils import gen_tensor, parse_create_tensor_args, create_general_test_tensor_torch, BATCH_SIZE, INPUT_SIZE, SEQ_LENGTH
 
diff --git a/tests/align/align_utils.py b/tests/align/align_utils.py
index 34f07a4928..368893c5eb 100644
--- a/tests/align/align_utils.py
+++ b/tests/align/align_utils.py
@@ -102,7 +102,7 @@ def align_tensors(tensor_alignment_data_iter: Iterable[TensorAlignmentData]):
         ff_tensor = torch.load(ff_filepath).cpu()
         torch_tensor = torch.load(torch_filepath).cpu()
         print(f"Checking {tensor_alignment_data.tensor_name} alignment...")
-        torch.testing.assert_close(ff_tensor, torch_tensor)
+        torch.testing.assert_close(ff_tensor, torch_tensor, rtol=1e-2, atol=1e-4)
 
 
 def parse_create_tensor_args():
diff --git a/tests/cpp_gpu_tests.sh b/tests/cpp_gpu_tests.sh
index 92d3280a1f..29e377e5bc 100755
--- a/tests/cpp_gpu_tests.sh
+++ b/tests/cpp_gpu_tests.sh
@@ -13,6 +13,9 @@ BATCHSIZE=$((GPUS * 64))
 FSIZE=13800
 ZSIZE=12192
 
+GPU_AVAILABLE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
+if [ $(( GPUS )) -gt $(( GPU_AVAILABLE )) ]; then echo "The test requires $GPUS GPUs, but only $GPU_AVAILABLE are available. Try reducing the number of nodes, or the number of gpus/node." ; exit; fi
+
 remove_mnist() {
 	rm -f train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz train-images-idx3-ubyte train-labels-idx1-ubyte
 }
@@ -48,6 +51,11 @@ if [[ -f "$FF_HOME/build/examples/cpp/AlexNet/alexnet" ]]; then
 	# TODO: fix split tests
 	# "$FF_HOME"/build/examples/cpp/split_test/split_test -ll:gpu "$GPUS" -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" -b ${BATCHSIZE} --only-data-parallel
 	# "$FF_HOME"/build/examples/cpp/split_test_2/split_test_2 -ll:gpu "$GPUS" -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" -b ${BATCHSIZE} --only-data-parallel
+	# Inference examples
+	# if [ $(( GPU_AVAILABLE )) -lt $(( 4 )) ]; then echo "Skipping LLAMA test because it requires 4 GPUs, but only $GPU_AVAILABLE are available. " ; exit 1; fi
+	# "$FF_HOME"/build/examples/cpp/inference/LLAMA/LLAMA -ll:gpu "$GPUS" -ll:util 8 -ll:fsize "$FSIZE" -ll:zsize 30000 --only-data-parallel
+	#"$FF_HOME"/build/examples/cpp/inference/mixture_of_experts/inference_moe -ll:gpu "$GPUS" -ll:util 8 -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" --only-data-parallel
+	#"$FF_HOME"/build/examples/cpp/inference/transformers/inference_transformers -ll:gpu "$GPUS" -ll:util 8 -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" --only-data-parallel
 else
 	python_packages=$(python -c "from distutils import sysconfig; print(sysconfig.get_python_lib(plat_specific=False,standard_lib=False))")
 	OLD_PATH="$PATH"
@@ -76,6 +84,11 @@ else
 			# TODO: fix split tests 
 			# split_test -ll:gpu "$GPUS" -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" -b ${BATCHSIZE} --only-data-parallel
 			# split_test_2 -ll:gpu "$GPUS" -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" -b ${BATCHSIZE} --only-data-parallel
+			# Inference examples
+			# if [ $(( GPU_AVAILABLE )) -lt $(( 4 )) ]; then echo "Skipping LLAMA test because it requires 4 GPUs, but only $GPU_AVAILABLE are available. " ; exit 1; fi
+			# LLAMA -ll:gpu "$GPUS" -ll:util 8 -ll:fsize "$FSIZE" -ll:zsize 30000 --only-data-parallel
+			#inference_moe -ll:gpu "$GPUS" -ll:util 8 -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" --only-data-parallel
+			#inference_transformers -ll:gpu "$GPUS" -ll:util 8 -ll:fsize "$FSIZE" -ll:zsize "$ZSIZE" --only-data-parallel
 		fi
 	done
 	export PATH="$OLD_PATH"
diff --git a/tests/gpt_tokenizer.cpp b/tests/gpt_tokenizer.cpp
new file mode 100644
index 0000000000..eb8ea069af
--- /dev/null
+++ b/tests/gpt_tokenizer.cpp
@@ -0,0 +1,80 @@
+/* Copyright 2023 CMU, Facebook, LANL, MIT, NVIDIA, and Stanford (alphabetical)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <flexflow/gpt_tokenizer.h>
+
+#include <string>
+
+int main(int argc, char *argv[]) {
+  if (argc != 2 || (strcmp(argv[1], "gpt-2") && strcmp(argv[1], "opt"))) {
+    fprintf(stderr, "Usage: %s <gpt-2|opt>\n", argv[0]);
+    return 1;
+  }
+  tokenizer_mode mode =
+      strcmp(argv[1], "gpt-2") == 0 ? GPT2_TOKENIZER : OPT_TOKENIZER;
+  std::string vocab_file = mode == GPT2_TOKENIZER ? "./gpt2_bpe/vocab.bpe"
+                                                  : "opt_bpe/gpt2-merges.txt";
+  std::string merge_file = mode == GPT2_TOKENIZER ? "./gpt2_bpe/encoder.json"
+                                                  : "opt_bpe/gpt2-vocab.json";
+
+  GPT_Tokenizer tokenizer(mode, merge_file, vocab_file);
+
+  std::string line;
+  std::vector<std::string> lines;
+  std::ifstream infile("./wikitext-103-raw/wiki.valid.raw");
+  if (!infile) {
+    std::cout << "Error opening input file" << std::endl;
+    return -1;
+  }
+  std::ofstream outfile(mode == GPT2_TOKENIZER
+                            ? "./wikitext-103-raw/wiki.valid.bpe.flexflow.gpt2"
+                            : "./wikitext-103-raw/wiki.valid.bpe.flexflow.opt",
+                        std::ofstream::out);
+  if (!outfile) {
+    std::cout << "Error opening output file" << std::endl;
+    return -1;
+  }
+  while (std::getline(infile, line)) {
+    lines.push_back(line);
+  }
+
+  std::vector<int32_t> input_ids;
+  std::vector<int32_t> mask_ids;
+  for (auto l = lines.begin(); l != lines.end(); ++l) {
+    std::string stripped_line = tokenizer.strip(*l);
+    if (stripped_line.length() == 0) {
+      outfile << *l << std::endl;
+    } else {
+      tokenizer.encode(
+          stripped_line, stripped_line.length(), &input_ids, &mask_ids);
+      bool first = true;
+      for (std::size_t i = 0; i < input_ids.size(); ++i) {
+        if (mask_ids[i]) {
+          if (!first) {
+            outfile << " ";
+          } else {
+            first = false;
+          }
+          outfile << input_ids[i];
+        }
+      }
+      outfile << std::endl;
+      std::string decoded_line = tokenizer.decode(input_ids, mask_ids);
+      assert(decoded_line == stripped_line);
+      input_ids.clear();
+      mask_ids.clear();
+    }
+  }
+}
diff --git a/tests/gpt_tokenizer_test.sh b/tests/gpt_tokenizer_test.sh
new file mode 100755
index 0000000000..de6d018372
--- /dev/null
+++ b/tests/gpt_tokenizer_test.sh
@@ -0,0 +1,107 @@
+#! /usr/bin/env bash
+set -x
+set -e
+
+cleanup() {
+	rm -rf wikitext-103-raw-v1.zip wikitext-103-raw gpt2_bpe opt_bpe gpt_tokenizer pytokenizer.py bpe.py hf_tokenizer.py 
+}
+
+# Cd into directory holding this script
+cd "${BASH_SOURCE[0]%/*}"
+
+# Clean up before test (just in case)
+cleanup
+
+# Compile the FlexFlow C++ tokenizer stand-alone
+g++ -std=c++11 -I../deps/json/include -I../include -o gpt_tokenizer gpt_tokenizer.cpp ../src/runtime/gpt_tokenizer.cc
+chmod +x gpt_tokenizer
+
+# Download and inflate wikitext dataset
+wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
+unzip wikitext-103-raw-v1.zip
+rm wikitext-103-raw-v1.zip
+
+###############################################################################################
+##################################### GPT-2 tests #############################################
+###############################################################################################
+
+# Download GPT-2 BPE vocab and merges files
+mkdir -p gpt2_bpe
+wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
+wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
+
+# Download minGPT bpe tokenizer for comparison
+wget -O bpe.py https://raw.githubusercontent.com/karpathy/minGPT/master/mingpt/bpe.py
+chmod +x bpe.py
+
+# Run the FlexFlow C++ tokenizer (standard GPT-2)
+./gpt_tokenizer gpt-2
+
+# Run the minGPT tokenizer
+cat << EOF > pytokenizer.py
+#!/usr/bin/env python
+from bpe import BPETokenizer
+
+tokenizer = BPETokenizer()
+inp="./wikitext-103-raw/wiki.valid.raw"
+outp="./wikitext-103-raw/wiki.valid.bpe.minGPT"
+with open(inp, "r") as infile:
+    with open(outp, "w+") as outfile:
+        for l in infile.readlines():
+            if len(l.strip()) == 0:
+                outfile.write(l)
+            else:
+                out = tokenizer(l.strip()).tolist()[0]
+                out = [str(x) for x in out]
+                out = " ".join(out)
+                outfile.write(out)
+                outfile.write("\n")
+EOF
+chmod +x pytokenizer.py
+./pytokenizer.py
+
+# Check that the outputs match
+diff ./wikitext-103-raw/wiki.valid.bpe.flexflow.gpt2 ./wikitext-103-raw/wiki.valid.bpe.minGPT
+
+###############################################################################################
+##################################### OPT tests ###############################################
+###############################################################################################
+
+# Download OPT vocab and merge files
+mkdir -p opt_bpe
+wget -O opt_bpe/gpt2-vocab.json https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-vocab.json
+wget -O opt_bpe/gpt2-merges.txt https://raw.githubusercontent.com/facebookresearch/metaseq/main/projects/OPT/assets/gpt2-merges.txt
+
+# Run the FlexFlow C++ tokenizer (OPT)
+./gpt_tokenizer opt
+
+# Run the Huggingface tokenizer
+pip3 install transformers
+cat << EOF > hf_tokenizer.py
+#!/usr/bin/env python
+from transformers import GPT2Tokenizer
+model_id = "facebook/opt-6.7b"
+tokenizer = GPT2Tokenizer.from_pretrained(model_id)
+inp="./wikitext-103-raw/wiki.valid.raw"
+outp="./wikitext-103-raw/wiki.valid.bpe.OPT"
+with open(inp, "r") as infile:
+    with open(outp, "w+") as outfile:
+        for l in infile.readlines():
+            if len(l.strip()) == 0:
+                outfile.write(l)
+            else:
+                input_ids = tokenizer(l.strip(), return_tensors="pt", padding=False).input_ids
+                out = input_ids.tolist()[0]
+                out = [str(x) for x in out]
+                out = " ".join(out)
+                outfile.write(out)
+                outfile.write("\n")
+EOF
+chmod +x hf_tokenizer.py
+./hf_tokenizer.py
+
+# Check that the outputs match
+diff ./wikitext-103-raw/wiki.valid.bpe.flexflow.opt ./wikitext-103-raw/wiki.valid.bpe.OPT
+
+# Clean up after test
+cleanup
diff --git a/tests/inference/cpp_inference_tests.sh b/tests/inference/cpp_inference_tests.sh
new file mode 100755
index 0000000000..6a108303d6
--- /dev/null
+++ b/tests/inference/cpp_inference_tests.sh
@@ -0,0 +1,250 @@
+#! /usr/bin/env bash
+set -x
+set -e
+
+# Cd into directory holding this script
+cd "${BASH_SOURCE[0]%/*}"
+
+###############################################################################################
+############################ Speculative inference tests ######################################
+###############################################################################################
+
+# LLAMA
+../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_llama.txt -pipeline-parallelism-degree 4
+# LLAMA (half precision)
+../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_llama_half.txt -pipeline-parallelism-degree 4
+
+# OPT
+../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_opt.txt -pipeline-parallelism-degree 4
+# OPT (half precision)
+../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_opt_half.txt -pipeline-parallelism-degree 4
+
+# Tensor parallelism tests
+if [ "$TENSOR_PARALLELISM_TESTS" = "ON" ]; then
+    # LLAMA
+    ../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_llama_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    # LLAMA (half precision)
+    ../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model decapoda-research/llama-7b-hf -ssm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_llama_half_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    
+    # OPT
+    ../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_opt_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    # OPT (half precision)
+    ../../build/inference/spec_infer/spec_infer -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/spec_inference_opt_half_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+fi
+
+###############################################################################################
+############################ Incremental decoding tests #######################################
+###############################################################################################
+
+# LLAMA (small model)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_160M.txt -pipeline-parallelism-degree 4
+# LLAMA (small model, half precision)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_160M_half.txt -pipeline-parallelism-degree 4
+
+# LLAMA (big model)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model decapoda-research/llama-7b-hf -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_7B.txt -pipeline-parallelism-degree 4
+# LLAMA (big model, half precision)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model decapoda-research/llama-7b-hf -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_7B_half.txt -pipeline-parallelism-degree 4
+
+# OPT (small model)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_125M.txt -pipeline-parallelism-degree 4
+# OPT (small model, half precision)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_125M_half.txt -pipeline-parallelism-degree 4
+
+# OPT (big model)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-6.7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_6B.txt -pipeline-parallelism-degree 4
+# OPT (big model, half precision)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-6.7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_6B_half.txt -pipeline-parallelism-degree 4
+
+# Falcon (full precision)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model tiiuae/falcon-7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_falcon_7B.txt -pipeline-parallelism-degree 4
+# Falcon (half precision)
+../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model tiiuae/falcon-7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_falcon_7B_half.txt -pipeline-parallelism-degree 4
+
+# # StarCoder (full precision)
+# ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model bigcode/starcoderbase-7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_starcoder_7B.txt -pipeline-parallelism-degree 4
+# # StarCoder (half precision)
+# ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model bigcode/starcoderbase-7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_starcoder_7B_half.txt -pipeline-parallelism-degree 4
+
+# Tensor parallelism tests
+if [ "$TENSOR_PARALLELISM_TESTS" = "ON" ]; then
+    # LLAMA (small model)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_160M_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_160M_tp4.txt -pipeline-parallelism-degree 1 -tensor-parallelism-degree 4
+    # LLAMA (small model, half precision)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_160M_half_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model JackFram/llama-160m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_160M_half_tp4.txt -pipeline-parallelism-degree 1 -tensor-parallelism-degree 4
+
+    # LLAMA (big model)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model decapoda-research/llama-7b-hf -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_7B_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    # LLAMA (big model, half precision)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model decapoda-research/llama-7b-hf -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_llama_7B_half_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+
+    # OPT (small model)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_125M_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_125M_tp4.txt -pipeline-parallelism-degree 1 -tensor-parallelism-degree 4
+    # OPT (small model, half precision)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_125M_half_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-125m -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_125M_half_tp.txt -pipeline-parallelism-degree 1 -tensor-parallelism-degree 4
+
+    # OPT (big model)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion --use-full-precision -llm-model facebook/opt-6.7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_6B_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+    # OPT (big model, half precision)
+    ../../build/inference/incr_decoding/incr_decoding -ll:gpu 4 -ll:fsize 14000 -ll:zsize 30000 --fusion -llm-model facebook/opt-6.7b -prompt ../../inference/prompt/test.json -output-file ../../inference/output/incr_decoding_opt_6B_half_tp.txt -pipeline-parallelism-degree 2 -tensor-parallelism-degree 2
+fi
+
+###############################################################################################
+############################### Alignment and Speed tests #####################################
+###############################################################################################
+
+##################################### Helper functions #######################################
+function check_partial_token_match {
+    local file1="$1"
+    local file2="$2"
+    local num_tokens_to_match=30
+
+    # Read the second line of the first file
+    third_line=$(sed -n '3p' "$file1")
+    read -r line1 <<< "$third_line"
+    tokens1=${line1#*: }
+    IFS=',' read -ra arr1 <<< "$tokens1"
+
+    # Read the second line of the second file
+    third_line=$(sed -n '3p' "$file2")
+    read -r line2 <<< "$third_line"
+    tokens2=${line2#*: }
+    IFS=',' read -ra arr2 <<< "$tokens2"
+
+    # Compare the first few integers in the two lists
+    for ((i = 0; i < num_tokens_to_match; i++)); do
+        if [[ "${arr1[$i]}" != "${arr2[$i]}" ]]; then
+            echo "The first $num_tokens_to_match tokens in files $file1 and $file2 are not identical."
+            exit 1
+        fi
+    done
+    #echo "The first $num_tokens_to_match integers are identical."
+}
+
+function compare_speed_spec_infer_incr_decoding {
+    local incrDec_file="$1"
+    local specInf_file="$2"
+
+    # Read the float numbers from the first line of the files
+    incrDec=$(sed -n '1 s/end-to-end latency: \(.*\)/\1/p' "$incrDec_file")
+    specInf=$(sed -n '1 s/end-to-end latency: \(.*\)/\1/p' "$specInf_file")
+
+    if ! command -v bc &> /dev/null; then
+        echo "bc is not installed. Installing..."
+        sudo apt-get install -y bc
+    fi
+    
+    # Perform the comparison
+    threshold=$(bc <<< "$specInf * 1.5")
+    if (( $(echo "$incrDec >= $threshold" | bc -l) )); then
+        #echo "The latency in $specInf_file is at least 1.5x smaller than the latency from $incrDec_file."
+        :
+    else
+        echo "Error: The latency in $specInf_file is not at least 1.5x smaller than the latency in $incrDec_file!"
+        exit 1
+    fi
+}
+
+function compare_decoding_steps_spec_infer_incr_decoding {
+    local incrDec_file="$1"
+    local specInf_file="$2"
+
+    # Read the number of decoding steps from the second line of the files
+    second_line=$(sed -n '2p' "$incrDec_file")
+    read -r line <<< "$second_line"
+    incrDec=${line#*: }
+    second_line=$(sed -n '2p' "$specInf_file")
+    read -r line <<< "$second_line"
+    specInf=${line#*: }
+
+    if ! command -v bc &> /dev/null; then
+        echo "bc is not installed. Installing..."
+        sudo apt-get install -y bc
+    fi
+    
+    # Perform the comparison
+    threshold=$(bc <<< "$specInf * 1.5")
+    if (( $(echo "$incrDec >= $threshold" | bc -l) )); then
+        #echo "The decoding steps in $specInf_file are at least 1.5x less than those in $incrDec_file."
+        :
+    else
+        echo "Error: The decoding steps in $specInf_file are not at least 1.5x less than those in $incrDec_file!"
+        exit 1
+    fi
+}
+
+############ Alignment between speculative inference and incremental decoding #################
+# Full precision
+diff <(tail -n +3 "../../inference/output/incr_decoding_llama_7B.txt") <(tail -n +3 "../../inference/output/spec_inference_llama.txt")
+diff <(tail -n +3 "../../inference/output/incr_decoding_opt_6B.txt")   <(tail -n +3 "../../inference/output/spec_inference_opt.txt")
+# Half precision
+check_partial_token_match "../../inference/output/incr_decoding_llama_7B_half.txt" "../../inference/output/spec_inference_llama_half.txt"
+check_partial_token_match "../../inference/output/incr_decoding_opt_6B_half.txt" "../../inference/output/spec_inference_opt_half.txt"
+
+# Speed test: speculative inference should be at very least 1.5x faster than incremental decoding
+# Full precision
+#compare_speed_spec_infer_incr_decoding "../../inference/output/incr_decoding_llama_7B.txt" "../../inference/output/spec_inference_llama.txt"
+#compare_speed_spec_infer_incr_decoding "../../inference/output/incr_decoding_opt_6B.txt" "../../inference/output/spec_inference_opt.txt"
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_decoding_llama_7B.txt" "../../inference/output/spec_inference_llama.txt"
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_decoding_opt_6B.txt" "../../inference/output/spec_inference_opt.txt"
+# Half precision
+#compare_speed_spec_infer_incr_decoding "../../inference/output/incr_decoding_llama_7B_half.txt" "../../inference/output/spec_inference_llama_half.txt"
+#compare_speed_spec_infer_incr_decoding "../../inference/output/incr_decoding_opt_6B_half.txt" "../../inference/output/spec_inference_opt_half.txt"
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_decoding_llama_7B_half.txt" "../../inference/output/spec_inference_llama_half.txt"
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_decoding_opt_6B_half.txt" "../../inference/output/spec_inference_opt_half.txt"
+
+############ Alignment between tensor model parallelism and pipeline parallelism only #################
+if [ "$TENSOR_PARALLELISM_TESTS" = "ON" ]; then
+    diff <(tail -n +3 "../../inference/output/spec_inference_llama_tp.txt") <(tail -n +3 "../../inference/output/spec_inference_llama.txt")
+    diff <(tail -n +3 "../../inference/output/spec_inference_opt_tp.txt")  <(tail -n +3 "../../inference/output/spec_inference_opt.txt")
+    check_partial_token_match "../../inference/output/spec_inference_llama_half_tp.txt" "../../inference/output/spec_inference_llama_half.txt"
+    check_partial_token_match "../../inference/output/spec_inference_opt_half_tp.txt" "../../inference/output/spec_inference_opt_half.txt"
+    diff <(tail -n +3 "../../inference/output/incr_decoding_llama_160M_tp.txt") <(tail -n +3 "../../inference/output/incr_decoding_llama_160M.txt")
+    check_partial_token_match "../../inference/output/incr_decoding_llama_160M_half_tp.txt" "../../inference/output/incr_decoding_llama_160M_half.txt"
+    diff <(tail -n +3 "../../inference/output/incr_decoding_llama_7B_tp.txt") <(tail -n +3 "../../inference/output/incr_decoding_llama_7B.txt")
+    check_partial_token_match "../../inference/output/incr_decoding_llama_7B_half_tp.txt" "../../inference/output/incr_decoding_llama_7B_half.txt"
+    diff <(tail -n +3 "../../inference/output/incr_decoding_opt_125M_tp.txt") <(tail -n +3 "../../inference/output/incr_decoding_opt_125M.txt")
+    check_partial_token_match "../../inference/output/incr_decoding_opt_125M_half_tp.txt" "../../inference/output/incr_decoding_opt_125M_half.txt"
+    diff <(tail -n +3 "../../inference/output/incr_decoding_opt_6B_tp.txt") <(tail -n +3 "../../inference/output/incr_decoding_opt_6B.txt")
+    check_partial_token_match "../../inference/output/incr_decoding_opt_6B_half_tp.txt" "../../inference/output/incr_decoding_opt_6B_half.txt"
+fi
+
+######################### Alignment tests with HuggingFace ####################################
+
+# LLAMA (small model, full precision)
+python3 ./huggingface_inference.py --model-name "JackFram/llama-160m" --tokenizer-model-name "JackFram/llama-160m" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_160M.txt" --gpu
+
+# LLAMA (small model, half precision)
+python3 ./huggingface_inference.py --model-name "JackFram/llama-160m" --tokenizer-model-name "JackFram/llama-160m" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_160M_half.txt" --gpu
+
+# LLAMA (big model, full precision)
+python3 ./huggingface_inference.py --model-name "decapoda-research/llama-7b-hf" --tokenizer-model-name "JackFram/llama-160m" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_7B.txt"
+
+# LLAMA (big model, half precision)
+python3 ./huggingface_inference.py --model-name "decapoda-research/llama-7b-hf" --tokenizer-model-name "JackFram/llama-160m" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_7B_half.txt" --gpu
+
+# OPT (small model, full precision)
+python3 ./huggingface_inference.py --model-name "facebook/opt-125m" --tokenizer-model-name "facebook/opt-125m" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_125M.txt" --gpu --max-length 128
+
+# OPT (small model, half precision)
+python3 ./huggingface_inference.py --model-name "facebook/opt-125m" --tokenizer-model-name "facebook/opt-125m" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_125M_half.txt" --gpu --max-length 128
+
+# OPT (big model, full precision)
+#python3 ./huggingface_inference.py --model-name "facebook/opt-6.7b" --tokenizer-model-name "facebook/opt-6.7b" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_6B.txt" --max-length 127
+
+# OPT (big model, half precision)
+#python3 ./huggingface_inference.py --model-name "facebook/opt-6.7b" --tokenizer-model-name "facebook/opt-6.7b" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_6B_half.txt" --gpu --max-length 127
+
+diff <(tail -n +2 "../../inference/output/huggingface_llama_160M.txt") <(tail -n +5 "../../inference/output/incr_decoding_llama_160M.txt")
+diff <(tail -n +2 "../../inference/output/huggingface_llama_160M_half.txt" | tr -s '[:space:]' '\n' | head -n 20) <(tail -n +5 "../../inference/output/incr_decoding_llama_160M_half.txt" | tr -s '[:space:]' '\n' | head -n 20)
+diff <(tail -n +2 "../../inference/output/huggingface_llama_7B.txt") <(tail -n +5 "../../inference/output/incr_decoding_llama_7B.txt")
+diff <(tail -n +2 "../../inference/output/huggingface_llama_7B_half.txt" | tr -s '[:space:]' '\n' | head -n 20) <(tail -n +5 "../../inference/output/incr_decoding_llama_7B_half.txt" | tr -s '[:space:]' '\n' | head -n 20)
+
+diff <(tail -n +2 "../../inference/output/huggingface_opt_125M.txt") <(tail -n +5 "../../inference/output/incr_decoding_opt_125M.txt")
+diff <(tail -n +2 "../../inference/output/huggingface_opt_125M_half.txt" | tr -s '[:space:]' '\n' | head -n 20) <(tail -n +5 "../../inference/output/incr_decoding_opt_125M_half.txt" | tr -s '[:space:]' '\n' | head -n 20)
+#diff <(tail -n +2 "../../inference/output/huggingface_opt_6B.txt") <(tail -n +5 "../../inference/output/incr_decoding_opt_6B.txt")
+#diff <(tail -n +2 "../../inference/output/huggingface_opt_6B_half.txt") <(tail -n +5 "../../inference/output/incr_decoding_opt_6B_half.txt")
diff --git a/tests/inference/huggingface_inference.py b/tests/inference/huggingface_inference.py
new file mode 100644
index 0000000000..788d001dd8
--- /dev/null
+++ b/tests/inference/huggingface_inference.py
@@ -0,0 +1,67 @@
+import argparse
+import json
+import os
+from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer
+
+def main():
+    # Change working dir to folder storing this script
+    abspath = os.path.abspath(__file__)
+    dname = os.path.dirname(abspath)
+    os.chdir(dname)
+
+    # Parse command line arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-name", type=str, required=True)
+    parser.add_argument("--tokenizer-model-name", type=str, required=True)
+    parser.add_argument("--max-length", type=int, default=128)
+    parser.add_argument("--prompt-file", type=str, required=True)
+    parser.add_argument("--output-file", type=str, required=True)
+    parser.add_argument(
+        "--use-full-precision", action="store_true", help="Use full precision"
+    )
+    parser.add_argument("--gpu", action="store_true", help="Run on GPU")
+    args = parser.parse_args()
+    # Check if max-length is greater than 0
+    if args.max_length <= 0:
+        print("Error: max-length must be greater than 0.")
+        return
+    # Check if prompt-file exists
+    if not os.path.isfile(args.prompt_file):
+        print(f"Error: {args.prompt_file} does not exist.")
+        return
+
+    # Read prompt-file into a list of strings
+    with open(args.prompt_file, "r") as f:
+        try:
+            prompt_list = json.load(f)
+        except json.JSONDecodeError:
+            print(f"Error: Unable to parse {args.prompt_file} as JSON.")
+            return
+
+    # Set default tensor type depending on argument indicating the float type to use
+    if not args.use_full_precision:
+        import torch
+
+        torch.set_default_tensor_type(torch.HalfTensor)
+
+    # Run huggingface model
+    device = "cuda" if args.gpu else "cpu"
+    model = AutoModelForCausalLM.from_pretrained(args.model_name).to(device)
+    if args.tokenizer_model_name == "JackFram/llama-160m":
+        tokenizer = LlamaTokenizer.from_pretrained("JackFram/llama-160m", use_fast=True)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_model_name)
+    with open(args.output_file, "w") as f:
+        for i, prompt in enumerate(prompt_list):
+            batch = tokenizer(
+                prompt_list, return_tensors="pt", add_special_tokens=True
+            ).to(device)
+            generated = model.generate(batch["input_ids"], max_length=args.max_length)
+            out = tokenizer.decode(generated[0])
+            # Write output to file
+            out_str = out if i == (len(prompt_list) - 1) else out + "\n"
+            f.write(out_str)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/inference/python_inference_tests.sh b/tests/inference/python_inference_tests.sh
new file mode 100755
index 0000000000..800c0ad043
--- /dev/null
+++ b/tests/inference/python_inference_tests.sh
@@ -0,0 +1,191 @@
+#! /usr/bin/env bash
+set -x
+set -e
+
+# Cd into directory holding this script
+cd "${BASH_SOURCE[0]%/*}"
+
+# Generate test configs
+python python_test_configs/generate_configs.py
+
+# Run all tests
+# Loop through .json files in the ./python_test_configs dir 
+for file in ./python_test_configs/*.json; do
+    # Check filename prefix
+    if [[ $file == *"incr_dec"* ]]; then
+      script="../../inference/python/incr_decoding.py"
+    elif [[ $file == *"spec_infer"* ]]; then  
+      script="../../inference/python/spec_infer.py"
+    fi
+    # Run script
+    python "$script" -config-file "$file" 
+done
+
+
+###############################################################################################
+############################### Alignment and Speed tests #####################################
+###############################################################################################
+
+##################################### Helper functions #######################################
+function check_partial_token_match {
+    local file1="$1"
+    local file2="$2"
+    local num_tokens_to_match=30
+
+    # Read the second line of the first file
+    third_line=$(sed -n '3p' "$file1")
+    read -r line1 <<< "$third_line"
+    tokens1=${line1#*: }
+    IFS=',' read -ra arr1 <<< "$tokens1"
+
+    # Read the second line of the second file
+    third_line=$(sed -n '3p' "$file2")
+    read -r line2 <<< "$third_line"
+    tokens2=${line2#*: }
+    IFS=',' read -ra arr2 <<< "$tokens2"
+
+    # Compare the first few integers in the two lists
+    for ((i = 0; i < num_tokens_to_match; i++)); do
+        if [[ "${arr1[$i]}" != "${arr2[$i]}" ]]; then
+            echo "The first $num_tokens_to_match tokens in files $file1 and $file2 are not identical."
+            exit 1
+        fi
+    done
+    #echo "The first $num_tokens_to_match integers are identical."
+}
+
+function compare_speed_spec_infer_incr_decoding {
+    local incrDec_file="$1"
+    local specInf_file="$2"
+
+    # Read the float numbers from the first line of the files
+    incrDec=$(sed -n '1 s/end-to-end latency: \(.*\)/\1/p' "$incrDec_file")
+    specInf=$(sed -n '1 s/end-to-end latency: \(.*\)/\1/p' "$specInf_file")
+
+    if ! command -v bc &> /dev/null; then
+        echo "bc is not installed. Installing..."
+        sudo apt-get install -y bc
+    fi
+    
+    # Perform the comparison
+    threshold=$(bc <<< "$specInf * 1.5")
+    if (( $(echo "$incrDec >= $threshold" | bc -l) )); then
+        #echo "The latency in $specInf_file is at least 1.5x smaller than the latency from $incrDec_file."
+        :
+    else
+        echo "Error: The latency in $specInf_file is not at least 1.5x smaller than the latency in $incrDec_file!"
+        exit 1
+    fi
+}
+
+function compare_decoding_steps_spec_infer_incr_decoding {
+    local incrDec_file="$1"
+    local specInf_file="$2"
+
+    # Read the number of decoding steps from the second line of the files
+    second_line=$(sed -n '2p' "$incrDec_file")
+    read -r line <<< "$second_line"
+    incrDec=${line#*: }
+    second_line=$(sed -n '2p' "$specInf_file")
+    read -r line <<< "$second_line"
+    specInf=${line#*: }
+
+    if ! command -v bc &> /dev/null; then
+        echo "bc is not installed. Installing..."
+        sudo apt-get install -y bc
+    fi
+    
+    # Perform the comparison
+    threshold=$(bc <<< "$specInf * 1.5")
+    if (( $(echo "$incrDec >= $threshold" | bc -l) )); then
+        #echo "The decoding steps in $specInf_file are at least 1.5x less than those in $incrDec_file."
+        :
+    else
+        echo "Error: The decoding steps in $specInf_file are not at least 1.5x less than those in $incrDec_file!"
+        exit 1
+    fi
+}
+
+############ Alignment between speculative inference and incremental decoding #################
+# Full precision
+diff <(tail -n +3 "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-1_tp_4_pp.txt") <(tail -n +3 "../../inference/output/spec_infer-python-llama-7b-hf-full_prec-1_tp_4_pp.txt")
+diff <(tail -n +3 "../../inference/output/incr_dec-python-opt-6.7b-full_prec-1_tp_4_pp.txt")   <(tail -n +3 "../../inference/output/spec_infer-python-opt-6.7b-full_prec-1_tp_4_pp.txt")
+# Half precision
+check_partial_token_match "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-1_tp_4_pp.txt" "../../inference/output/spec_infer-python-llama-7b-hf-half_prec-1_tp_4_pp.txt"
+check_partial_token_match "../../inference/output/incr_dec-python-opt-6.7b-half_prec-1_tp_4_pp.txt" "../../inference/output/spec_infer-python-opt-6.7b-half_prec-1_tp_4_pp.txt"
+
+# Speed test: speculative inference should be at very least 1.5x faster than incremental decoding
+# Full precision
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-1_tp_4_pp.txt" "../../inference/output/spec_infer-python-llama-7b-hf-full_prec-1_tp_4_pp.txt"
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_dec-python-opt-6.7b-full_prec-1_tp_4_pp.txt" "../../inference/output/spec_infer-python-opt-6.7b-full_prec-1_tp_4_pp.txt"
+# Half precision
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-1_tp_4_pp.txt" "../../inference/output/spec_infer-python-llama-7b-hf-half_prec-1_tp_4_pp.txt"
+compare_decoding_steps_spec_infer_incr_decoding "../../inference/output/incr_dec-python-opt-6.7b-half_prec-1_tp_4_pp.txt" "../../inference/output/spec_infer-python-opt-6.7b-half_prec-1_tp_4_pp.txt"
+
+############ Alignment between tensor model parallelism and pipeline parallelism only #################
+## Specinfer
+# LLAMA
+diff <(tail -n +3 "../../inference/output/spec_infer-python-llama-7b-hf-full_prec-2_tp_2_pp.txt") <(tail -n +3 "../../inference/output/spec_infer-python-llama-7b-hf-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/spec_infer-python-llama-7b-hf-half_prec-2_tp_2_pp.txt" "../../inference/output/spec_infer-python-llama-7b-hf-half_prec-1_tp_4_pp.txt"
+# OPT
+diff <(tail -n +3 "../../inference/output/spec_infer-python-opt-6.7b-full_prec-2_tp_2_pp.txt")  <(tail -n +3 "../../inference/output/spec_infer-python-opt-6.7b-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/spec_infer-python-opt-6.7b-half_prec-2_tp_2_pp.txt" "../../inference/output/spec_infer-python-opt-6.7b-half_prec-1_tp_4_pp.txt"
+
+## Incremental decoding
+# Small LLAMA
+diff <(tail -n +3 "../../inference/output/incr_dec-python-llama-160m-full_prec-2_tp_2_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-llama-160m-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/incr_dec-python-llama-160m-half_prec-2_tp_2_pp.txt" "../../inference/output/incr_dec-python-llama-160m-half_prec-1_tp_4_pp.txt"
+diff <(tail -n +3 "../../inference/output/incr_dec-python-llama-160m-full_prec-4_tp_1_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-llama-160m-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/incr_dec-python-llama-160m-half_prec-4_tp_1_pp.txt" "../../inference/output/incr_dec-python-llama-160m-half_prec-1_tp_4_pp.txt"
+# Big LLAMA
+diff <(tail -n +3 "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-2_tp_2_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-2_tp_2_pp.txt" "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-1_tp_4_pp.txt"
+#diff <(tail -n +3 "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-4_tp_1_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-1_tp_4_pp.txt")
+#check_partial_token_match "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-4_tp_1_pp.txt" "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-1_tp_4_pp.txt"
+# Small OPT
+diff <(tail -n +3 "../../inference/output/incr_dec-python-opt-125m-full_prec-2_tp_2_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-opt-125m-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/incr_dec-python-opt-125m-half_prec-2_tp_2_pp.txt" "../../inference/output/incr_dec-python-opt-125m-half_prec-1_tp_4_pp.txt"
+diff <(tail -n +3 "../../inference/output/incr_dec-python-opt-125m-full_prec-4_tp_1_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-opt-125m-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/incr_dec-python-opt-125m-half_prec-4_tp_1_pp.txt" "../../inference/output/incr_dec-python-opt-125m-half_prec-1_tp_4_pp.txt"
+# Big OPT
+diff <(tail -n +3 "../../inference/output/incr_dec-python-opt-6.7b-full_prec-2_tp_2_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-opt-6.7b-full_prec-1_tp_4_pp.txt")
+check_partial_token_match "../../inference/output/incr_dec-python-opt-6.7b-half_prec-2_tp_2_pp.txt" "../../inference/output/incr_dec-python-opt-6.7b-half_prec-1_tp_4_pp.txt"
+#diff <(tail -n +3 "../../inference/output/incr_dec-python-opt-6.7b-full_prec-4_tp_1_pp.txt") <(tail -n +3 "../../inference/output/incr_dec-python-opt-6.7b-full_prec-1_tp_4_pp.txt")
+#check_partial_token_match "../../inference/output/incr_dec-python-opt-6.7b-half_prec-4_tp_1_pp.txt" "../../inference/output/incr_dec-python-opt-6.7b-half_prec-1_tp_4_pp.txt"
+
+
+######################### Alignment tests with HuggingFace ####################################
+
+# LLAMA (small model, full precision)
+python3 ./huggingface_inference.py --model-name "JackFram/llama-160m" --tokenizer-model-name "JackFram/llama-160m" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_160M.txt" --gpu
+
+# LLAMA (small model, half precision)
+python3 ./huggingface_inference.py --model-name "JackFram/llama-160m" --tokenizer-model-name "JackFram/llama-160m" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_160M_half.txt" --gpu
+
+# LLAMA (big model, full precision)
+python3 ./huggingface_inference.py --model-name "decapoda-research/llama-7b-hf" --tokenizer-model-name "JackFram/llama-160m" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_7B.txt"
+
+# LLAMA (big model, half precision)
+python3 ./huggingface_inference.py --model-name "decapoda-research/llama-7b-hf" --tokenizer-model-name "JackFram/llama-160m" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_llama_7B_half.txt" --gpu
+
+# OPT (small model, full precision)
+python3 ./huggingface_inference.py --model-name "facebook/opt-125m" --tokenizer-model-name "facebook/opt-125m" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_125M.txt" --gpu --max-length 128
+
+# OPT (small model, half precision)
+python3 ./huggingface_inference.py --model-name "facebook/opt-125m" --tokenizer-model-name "facebook/opt-125m" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_125M_half.txt" --gpu --max-length 128
+
+# OPT (big model, full precision)
+#python3 ./huggingface_inference.py --model-name "facebook/opt-6.7b" --tokenizer-model-name "facebook/opt-6.7b" --use-full-precision --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_6B.txt" --max-length 127
+
+# OPT (big model, half precision)
+#python3 ./huggingface_inference.py --model-name "facebook/opt-6.7b" --tokenizer-model-name "facebook/opt-6.7b" --prompt-file "../../inference/prompt/test.json" --output-file "../../inference/output/huggingface_opt_6B_half.txt" --gpu --max-length 127
+
+diff <(tail -n +2 "../../inference/output/huggingface_llama_160M.txt") <(tail -n +5 "../../inference/output/incr_dec-python-llama-160m-full_prec-1_tp_4_pp.txt")
+diff <(tail -n +2 "../../inference/output/huggingface_llama_160M_half.txt" | tr -s '[:space:]' '\n' | head -n 20) <(tail -n +5 "../../inference/output/incr_dec-python-llama-160m-half_prec-1_tp_4_pp.txt" | tr -s '[:space:]' '\n' | head -n 20)
+diff <(tail -n +2 "../../inference/output/huggingface_llama_7B.txt") <(tail -n +5 "../../inference/output/incr_dec-python-llama-7b-hf-full_prec-1_tp_4_pp.txt")
+diff <(tail -n +2 "../../inference/output/huggingface_llama_7B_half.txt" | tr -s '[:space:]' '\n' | head -n 20) <(tail -n +5 "../../inference/output/incr_dec-python-llama-7b-hf-half_prec-1_tp_4_pp.txt" | tr -s '[:space:]' '\n' | head -n 20)
+
+diff <(tail -n +2 "../../inference/output/huggingface_opt_125M.txt") <(tail -n +5 "../../inference/output/incr_dec-python-opt-125m-full_prec-1_tp_4_pp.txt")
+diff <(tail -n +2 "../../inference/output/huggingface_opt_125M_half.txt" | tr -s '[:space:]' '\n' | head -n 20) <(tail -n +5 "../../inference/output/incr_dec-python-opt-125m-half_prec-1_tp_4_pp.txt" | tr -s '[:space:]' '\n' | head -n 20)
+#diff <(tail -n +2 "../../inference/output/huggingface_opt_6B.txt") <(tail -n +5 "../../inference/output/incr_dec-python-opt-6.7b-full_prec-1_tp_4_pp.txt")
+#diff <(tail -n +2 "../../inference/output/huggingface_opt_6B_half.txt") <(tail -n +5 "../../inference/output/incr_dec-python-opt-6.7b-half_prec-1_tp_4_pp.txt")
diff --git a/tests/inference/python_test_configs/generate_configs.py b/tests/inference/python_test_configs/generate_configs.py
new file mode 100644
index 0000000000..9c4c37b2e7
--- /dev/null
+++ b/tests/inference/python_test_configs/generate_configs.py
@@ -0,0 +1,124 @@
+#!/usr/bin/env python
+import os, json
+
+# Base configs dictionaries
+ff_init_configs = {
+    # required parameters
+    "num_gpus": 4,
+    "memory_per_gpu": 14000,
+    "zero_copy_memory_per_node": 30000,
+    # optional parameters
+    "num_cpus": 4,
+    "legion_utility_processors": 4,
+    "data_parallelism_degree": 1,
+    "tensor_parallelism_degree": 1,
+    "pipeline_parallelism_degree": 4,
+    "offload": False,
+    "offload_reserve_space_size": 1024**2,
+    "use_4bit_quantization": False,
+    "use_8bit_quantization": False,
+    "profiling": False,
+    "fusion": True,
+}
+llm_configs = {
+    # required parameters
+    "llm_model": "tiiuae/falcon-7b",
+    # optional parameters
+    "cache_path": "",
+    "refresh_cache": False,
+    "full_precision": True,
+    "prompt": "",
+    "output_file": "",
+}
+ssm_configs = {
+    "ssms": [
+        {
+            # required ssm parameter
+            "ssm_model": "JackFram/llama-160m",
+            # optional ssm parameters
+            "cache_path": "",
+            "refresh_cache": False,
+            "full_precision": False,
+        },
+    ]
+}
+# Merge dictionaries
+ff_init_configs.update(llm_configs)
+
+# Test parameters to fill in
+llama_models = ["decapoda-research/llama-7b-hf", "JackFram/llama-160m"]
+opt_models = ["facebook/opt-6.7b", "facebook/opt-125m"]
+falcon_models = ["tiiuae/falcon-7b",]
+# starcoder_models = ["bigcode/starcoderbase-7b",]
+parallelism_settings = [(1,4), (2,2), (4,1)]
+
+# The paths below should be with respect to the folder from which the tests are launched (FF_HOME/tests/inference)
+prompt_file = "../../inference/prompt/test.json"
+output_folder = "../../inference/output"
+
+# Change working dir to folder storing this script
+abspath = os.path.abspath(__file__)
+dname = os.path.dirname(abspath)
+os.chdir(dname)
+
+
+# Generate incremental decoding configs
+all_models = llama_models + opt_models + falcon_models
+for model_name in all_models:
+    for full_precision in (True, False):
+        for parallelism_degrees in parallelism_settings:
+            
+            tp, pp = parallelism_degrees
+
+            # Tensor parallelism not supported by small Falcon model atm
+            if tp > 1 and ("falcon" in model_name or "starcoder" in model_name):
+                continue
+            # skip tp=4 for big models
+            if tp > 2 and ("7b" in model_name or "6.7b" in model_name):
+                continue
+            
+            _, after_slash = model_name.rsplit("/", maxsplit=1)
+            filename = "incr_dec-" + "python-" + after_slash + ("-full_prec-" if full_precision else "-half_prec-") + f"{tp}_tp_{pp}_pp"
+            test_configs_file = "./" + filename + ".json"
+            output_file = os.path.join(output_folder, filename+".txt")
+            
+            ff_init_configs["tensor_parallelism_degree"] = tp
+            ff_init_configs["pipeline_parallelism_degree"] = pp
+            ff_init_configs["llm_model"] = model_name
+            ff_init_configs["full_precision"] = full_precision
+            ff_init_configs["output_file"] = output_file
+            ff_init_configs["prompt"] = prompt_file
+
+            with open(test_configs_file, "w+") as outfile:
+                json.dump(ff_init_configs, outfile, indent=4)
+
+# Generate speculative inference configs
+model_pairs = [llama_models, opt_models]
+for model_pair in model_pairs:
+    for full_precision in (True, False):
+        for parallelism_degrees in parallelism_settings:
+            big_model, small_model = model_pair
+            tp, pp = parallelism_degrees
+
+            # Skip fully tp tests
+            if tp > 2:
+                continue
+
+            _, after_slash = big_model.rsplit("/", maxsplit=1)
+            filename = "spec_infer-" + "python-" + after_slash + ("-full_prec-" if full_precision else "-half_prec-") + f"{tp}_tp_{pp}_pp"
+            test_configs_file = "./" + filename + ".json"
+            output_file = os.path.join(output_folder, filename+".txt")
+            
+            ff_init_configs["tensor_parallelism_degree"] = tp
+            ff_init_configs["pipeline_parallelism_degree"] = pp
+            ff_init_configs["llm_model"] = big_model
+            ff_init_configs["full_precision"] = full_precision
+            ff_init_configs["output_file"] = output_file
+            ff_init_configs["prompt"] = prompt_file
+            
+            ssm_configs["ssms"][0]["ssm_model"] = small_model
+            ssm_configs["ssms"][0]["full_precision"] = full_precision
+            ff_init_configs.update(ssm_configs)
+
+            with open(test_configs_file, "w+") as outfile:
+                json.dump(ff_init_configs, outfile, indent=4)
diff --git a/tests/inference_tests.sh b/tests/inference_tests.sh
new file mode 100755
index 0000000000..b1d45853e2
--- /dev/null
+++ b/tests/inference_tests.sh
@@ -0,0 +1,43 @@
+#! /usr/bin/env bash
+set -x
+set -e
+
+cleanup() {
+    rm -rf ../inference/prompt ../inference/output
+}
+
+# Cd into directory holding this script
+cd "${BASH_SOURCE[0]%/*}"
+
+# Enable Python tests (on by default)
+PYTHON_INFERENCE_TESTS=${PYTHON_INFERENCE_TESTS:-ON}
+# Enable C++ tests, (off by default)
+CPP_INFERENCE_TESTS=${CPP_INFERENCE_TESTS:-OFF}
+# Enable model parallelism tests in C++, if desired
+TENSOR_PARALLELISM_TESTS=${TENSOR_PARALLELISM_TESTS:-OFF}
+
+# Clean up before test (just in case)
+cleanup
+
+# Make sure supported version of protobuf is installed
+pip3 install protobuf==3.20.3
+
+# Download the weights in both half and full precision
+python3 ../inference/utils/download_hf_model.py "decapoda-research/llama-7b-hf" "JackFram/llama-160m" "facebook/opt-6.7b" "facebook/opt-125m" "tiiuae/falcon-7b"
+
+# Create test prompt file
+mkdir -p ../inference/prompt
+echo '["Give three tips for staying healthy."]' > ../inference/prompt/test.json
+
+# Create output folder
+mkdir -p ../inference/output
+
+if [[ "$PYTHON_INFERENCE_TESTS" == "ON" ]]; then
+    echo "Running Python inference tests..."
+    ./inference/python_inference_tests.sh
+fi
+if [[ "$CPP_INFERENCE_TESTS" == "ON" ]]; then
+    echo "Running C++ inference tests..."
+    ./inference/cpp_inference_tests.sh
+fi
+
diff --git a/triton/src/model.cc b/triton/src/model.cc
index a61b207bdd..6d5da30bea 100644
--- a/triton/src/model.cc
+++ b/triton/src/model.cc
@@ -22,20 +22,22 @@
 
 using namespace Legion;
 
-namespace triton { namespace backend { namespace legion {
-
-TRITONSERVER_Error*
-LegionModelState::Create(
-    TRITONBACKEND_Model* triton_model, const std::string& name,
-    uint64_t version, LegionTritonRuntime* runtime, LegionModelState** state)
-{
+namespace triton {
+namespace backend {
+namespace legion {
+
+TRITONSERVER_Error *LegionModelState::Create(TRITONBACKEND_Model *triton_model,
+                                             std::string const &name,
+                                             uint64_t version,
+                                             LegionTritonRuntime *runtime,
+                                             LegionModelState **state) {
   std::unique_ptr<LegionModelState> lstate;
   try {
     lstate.reset(new LegionModelState(triton_model, runtime, name, version));
-  }
-  catch (const BackendModelException& ex) {
+  } catch (BackendModelException const &ex) {
     RETURN_ERROR_IF_TRUE(
-        ex.err_ == nullptr, TRITONSERVER_ERROR_INTERNAL,
+        ex.err_ == nullptr,
+        TRITONSERVER_ERROR_INTERNAL,
         std::string("unexpected nullptr in BackendModelException"));
     RETURN_IF_ERROR(ex.err_);
   }
@@ -45,15 +47,15 @@ LegionModelState::Create(
 
   // Auto-complete the configuration if requested...
   bool auto_complete_config = false;
-  RETURN_IF_ERROR(TRITONBACKEND_ModelAutoCompleteConfig(
-      triton_model, &auto_complete_config));
+  RETURN_IF_ERROR(TRITONBACKEND_ModelAutoCompleteConfig(triton_model,
+                                                        &auto_complete_config));
   if (auto_complete_config) {
     RETURN_IF_ERROR(lstate->AutoCompleteConfig());
 
     triton::common::TritonJson::WriteBuffer json_buffer;
     lstate->ModelConfig().Write(&json_buffer);
 
-    TRITONSERVER_Message* message;
+    TRITONSERVER_Message *message;
     RETURN_IF_ERROR(TRITONSERVER_MessageNewFromSerializedJson(
         &message, json_buffer.Base(), json_buffer.Size()));
     RETURN_IF_ERROR(TRITONBACKEND_ModelSetConfig(
@@ -62,21 +64,21 @@ LegionModelState::Create(
   RETURN_IF_ERROR(lstate->ValidateModelConfig());
   *state = lstate.release();
   runtime->RecordModel(*state);
-  return nullptr;  // success
+  return nullptr; // success
 }
 
-LegionModelState::~LegionModelState(void)
-{
+LegionModelState::~LegionModelState(void) {
   FreeLayers();
-  for (auto& input : inputs_) delete input.second;
-  if (strategy_)
+  for (auto &input : inputs_) {
+    delete input.second;
+  }
+  if (strategy_) {
     delete strategy_;
+  }
   runtime_->RemoveModel(this);
 }
 
-TRITONSERVER_Error*
-LegionModelState::LoadModel()
-{
+TRITONSERVER_Error *LegionModelState::LoadModel() {
   // TODO: load files based on the default / cc file name that may be set
   // in model config
   auto model_path = JoinPath({RepositoryPath(), std::to_string(Version())});
@@ -87,12 +89,16 @@ LegionModelState::LoadModel()
   // load the ONNX model description as a list of layers
   // with tensor dependences between then and put them in layers_
   RETURN_IF_ERROR(OnnxParser::LoadModel(
-      [this](
-          Realm::Processor::Kind kind) -> const std::vector<Realm::Processor>& {
+      [this](Realm::Processor::Kind kind)
+          -> std::vector<Realm::Processor> const & {
         return runtime_->FindLocalProcessors(kind);
       },
-      this, strategy_, JoinPath({model_path, "model.onnx"}), &inputs_,
-      &outputs_, &layers_));
+      this,
+      strategy_,
+      JoinPath({model_path, "model.onnx"}),
+      &inputs_,
+      &outputs_,
+      &layers_));
   RETURN_IF_ERROR(SetOutputInfos());
 
   // Should have the same number of layers in both cases
@@ -107,18 +113,14 @@ LegionModelState::LoadModel()
   return nullptr;
 }
 
-unsigned
-LegionModelState::ReserveInstance(void)
-{
+unsigned LegionModelState::ReserveInstance(void) {
   AutoLock<true> lock(lock_);
   unsigned result = instances_.size();
   instances_.resize(result + 1, nullptr);
   return result;
 }
 
-void
-LegionModelState::RecordInstance(LegionModelInstance* instance)
-{
+void LegionModelState::RecordInstance(LegionModelInstance *instance) {
   assert(instance->model_state_ == this);
   AutoLock<true> lock(lock_, false /*exclusive*/);
   assert(instance->index_ < instances_.size());
@@ -126,27 +128,30 @@ LegionModelState::RecordInstance(LegionModelInstance* instance)
   instances_[instance->index_] = instance;
 }
 
-void
-LegionModelState::initialize(
-    LegionModelInstance* instance, const unsigned instance_index,
-    Runtime* runtime, Context ctx, MapperID mapper)
-{
+void LegionModelState::initialize(LegionModelInstance *instance,
+                                  unsigned const instance_index,
+                                  Runtime *runtime,
+                                  Context ctx,
+                                  MapperID mapper) {
   // First create logical regions for all the input tensors
-  for (auto& input : inputs_) instance->create_tensor_region(input.second);
+  for (auto &input : inputs_) {
+    instance->create_tensor_region(input.second);
+  }
 
-  for (auto layer : layers_)
+  for (auto layer : layers_) {
     layer->initialize(instance, instance_index, runtime, ctx, mapper);
+  }
 }
 
-void
-LegionModelState::forward(
-    LegionModelInstance* instance, const unsigned instance_index,
-    Runtime* runtime, Context ctx, MapperID mapper,
-    const std::vector<InputTensor>& inputs,
-    const std::vector<OutputTensor>& outputs,
-    std::vector<uint64_t>& compute_input_end_ns,
-    std::vector<uint64_t>& compute_output_start_ns)
-{
+void LegionModelState::forward(LegionModelInstance *instance,
+                               unsigned const instance_index,
+                               Runtime *runtime,
+                               Context ctx,
+                               MapperID mapper,
+                               std::vector<InputTensor> const &inputs,
+                               std::vector<OutputTensor> const &outputs,
+                               std::vector<uint64_t> &compute_input_end_ns,
+                               std::vector<uint64_t> &compute_output_start_ns) {
   assert(inputs.size() == inputs_.size());
   assert(outputs.size() == outputs_.size());
   // Attach the external memory allocations to the logical regions for the
@@ -154,34 +159,40 @@ LegionModelState::forward(
   const std::vector<FieldID> fields(1, FID_DATA);
   std::vector<PhysicalRegion> input_regions(inputs.size());
   for (unsigned idx = 0; idx < inputs.size(); idx++) {
-    const InputTensor& input = inputs[idx];
+    InputTensor const &input = inputs[idx];
     assert(input.buffers_.size() == 1);
     assert(input.buffer_locations_.size() == 1);
     assert(input.buffer_memories_.size() == 1);
     assert(input.strides_.size() == inputs_[idx].second->bounds.size());
     LogicalRegion region = inputs_[idx].second->region[instance_index];
-    AttachLauncher launcher(
-        LEGION_EXTERNAL_INSTANCE, region, region, false /*restricted*/,
-        false /*mapped*/);
-    launcher.attach_array_soa(
-        const_cast<void*>(input.buffers_[0]), false /*not column major*/,
-        fields, input.buffer_memories_[0]);
+    AttachLauncher launcher(LEGION_EXTERNAL_INSTANCE,
+                            region,
+                            region,
+                            false /*restricted*/,
+                            false /*mapped*/);
+    launcher.attach_array_soa(const_cast<void *>(input.buffers_[0]),
+                              false /*not column major*/,
+                              fields,
+                              input.buffer_memories_[0]);
     input_regions[idx] = runtime->attach_external_resource(ctx, launcher);
   }
   std::vector<PhysicalRegion> output_regions(outputs.size());
   for (unsigned idx = 0; idx < outputs.size(); idx++) {
-    const OutputTensor& output = outputs[idx];
+    OutputTensor const &output = outputs[idx];
     assert(output.buffers_.size() == 1);
     assert(output.buffer_locations_.size() == 1);
     assert(output.buffer_memories_.size() == 1);
     assert(output.strides_.size() == outputs_[idx].second->bounds.size());
     LogicalRegion region = outputs_[idx].second->region[instance_index];
-    AttachLauncher launcher(
-        LEGION_EXTERNAL_INSTANCE, region, region, false /*restricted*/,
-        false /*mapped*/);
-    launcher.attach_array_soa(
-        output.buffers_[0], false /*not column major*/, fields,
-        output.buffer_memories_[0]);
+    AttachLauncher launcher(LEGION_EXTERNAL_INSTANCE,
+                            region,
+                            region,
+                            false /*restricted*/,
+                            false /*mapped*/);
+    launcher.attach_array_soa(output.buffers_[0],
+                              false /*not column major*/,
+                              fields,
+                              output.buffer_memories_[0]);
     output_regions[idx] = runtime->attach_external_resource(ctx, launcher);
   }
   // Execution fence for timing operation
@@ -191,45 +202,50 @@ LegionModelState::forward(
 
   // We can trace the execution of this model since it should be the same
   runtime->begin_trace(ctx, 0 /*only ever have one trace*/);
-  for (auto layer : layers_)
+  for (auto layer : layers_) {
     layer->forward(instance, instance_index, runtime, ctx, mapper);
+  }
   runtime->end_trace(ctx, 0 /*only ever have one trace*/);
 
   // Execution fence for timing operation
   runtime->issue_execution_fence(ctx);
   Future stop = runtime->issue_timing_measurement(ctx, timing_launcher);
   // Detach the external memory allocations
-  for (unsigned idx = 0; idx < input_regions.size(); idx++)
+  for (unsigned idx = 0; idx < input_regions.size(); idx++) {
     runtime->detach_external_resource(ctx, input_regions[idx], false /*flush*/);
-  for (unsigned idx = 0; idx < output_regions.size(); idx++)
+  }
+  for (unsigned idx = 0; idx < output_regions.size(); idx++) {
     runtime->detach_external_resource(ctx, output_regions[idx], true /*flush*/);
+  }
 
   const uint64_t start_time = start.get_result<long long>();
-  for (unsigned idx = 0; idx < compute_input_end_ns.size(); idx++)
+  for (unsigned idx = 0; idx < compute_input_end_ns.size(); idx++) {
     compute_input_end_ns[idx] = start_time;
+  }
 
   const uint64_t stop_time = stop.get_result<long long>();
-  for (unsigned idx = 0; idx < compute_output_start_ns.size(); idx++)
+  for (unsigned idx = 0; idx < compute_output_start_ns.size(); idx++) {
     compute_output_start_ns[idx] = stop_time;
+  }
 
   // Wait for everything to be done before we return
   Future done = runtime->issue_execution_fence(ctx);
   done.wait();
 }
 
-void
-LegionModelState::finalize(
-    LegionModelInstance* instance, const unsigned instance_index,
-    Runtime* runtime, Context ctx, MapperID mapper)
-{
-  for (auto layer : layers_)
+void LegionModelState::finalize(LegionModelInstance *instance,
+                                unsigned const instance_index,
+                                Runtime *runtime,
+                                Context ctx,
+                                MapperID mapper) {
+  for (auto layer : layers_) {
     layer->finalize(instance, instance_index, runtime, ctx, mapper);
+  }
 }
 
-LegionModelInstance*
-LegionModelState::FindInstance(
-    unsigned instance_index, bool external, bool need_lock)
-{
+LegionModelInstance *LegionModelState::FindInstance(unsigned instance_index,
+                                                    bool external,
+                                                    bool need_lock) {
   if (need_lock) {
     if (external) {
       AutoLock<true> lock(lock_, false /*exclusive*/);
@@ -243,23 +259,17 @@ LegionModelState::FindInstance(
   return instances_[instance_index];
 }
 
-const PartitionStrategy*
-LegionModelState::GetStrategy(void) const
-{
+PartitionStrategy const *LegionModelState::GetStrategy(void) const {
   assert(strategy_ != nullptr);
   return strategy_;
 }
 
-TRITONSERVER_Error*
-LegionModelState::AutoCompleteConfig()
-{
+TRITONSERVER_Error *LegionModelState::AutoCompleteConfig() {
   // FIXME: Check with the FFModel
-  return nullptr;  // success
+  return nullptr; // success
 }
 
-TRITONSERVER_Error*
-LegionModelState::ValidateModelConfig()
-{
+TRITONSERVER_Error *LegionModelState::ValidateModelConfig() {
   // Constraints that apply to models in general
   {
     triton::common::TritonJson::Value igs;
@@ -295,8 +305,8 @@ LegionModelState::ValidateModelConfig()
 
   {
     // Build a map from name to tensors of the model for easy lookup
-    std::map<std::string, Tensor*> tensors;
-    for (const auto& io : inputs_) {
+    std::map<std::string, Tensor *> tensors;
+    for (auto const &io : inputs_) {
       tensors.emplace(io.first, io.second);
     }
 
@@ -306,10 +316,10 @@ LegionModelState::ValidateModelConfig()
     if (ios.ArraySize() != tensors.size()) {
       return TRITONSERVER_ErrorNew(
           TRITONSERVER_ERROR_INVALID_ARG,
-          (std::string(
-               "configuration for model '" + Name() + "' specifies " +
-               std::to_string(ios.ArraySize()) + " inputs, the model has " +
-               std::to_string(tensors.size()))
+          (std::string("configuration for model '" + Name() + "' specifies " +
+                       std::to_string(ios.ArraySize()) +
+                       " inputs, the model has " +
+                       std::to_string(tensors.size()))
                .c_str()));
     }
 
@@ -322,10 +332,11 @@ LegionModelState::ValidateModelConfig()
       // Check datatypes
       std::string io_dtype;
       RETURN_IF_ERROR(io.MemberAsString("data_type", &io_dtype));
-      RETURN_ERROR_IF_TRUE(
-          (io_dtype == "TYPE_STRING"), TRITONSERVER_ERROR_INVALID_ARG,
-          std::string("unsupported datatype '") + io_dtype + "' for tensor '" +
-              io_name + "' for model '" + Name() + "'");
+      RETURN_ERROR_IF_TRUE((io_dtype == "TYPE_STRING"),
+                           TRITONSERVER_ERROR_INVALID_ARG,
+                           std::string("unsupported datatype '") + io_dtype +
+                               "' for tensor '" + io_name + "' for model '" +
+                               Name() + "'");
       // If a reshape is provided for the input then use that when
       // validating that the model matches what is expected.
       std::vector<int64_t> dims;
@@ -335,11 +346,12 @@ LegionModelState::ValidateModelConfig()
       } else {
         RETURN_IF_ERROR(ParseShape(io, "dims", &dims));
       }
-      for (const auto dim : dims) {
+      for (auto const dim : dims) {
         RETURN_ERROR_IF_TRUE(
-            (dim == WILDCARD_DIM), TRITONSERVER_ERROR_INVALID_ARG,
-            std::string(
-                "dynamic tensor is not supported for model '" + Name() + "'"));
+            (dim == WILDCARD_DIM),
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string("dynamic tensor is not supported for model '" + Name() +
+                        "'"));
       }
 
       // Check the properties against the corresponding tensor
@@ -347,28 +359,26 @@ LegionModelState::ValidateModelConfig()
       if (it == tensors.end()) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "configuration for model '" + Name() + "' specifies tensor '" +
-                 io_name + "' which is not found in the model")
+            (std::string("configuration for model '" + Name() +
+                         "' specifies tensor '" + io_name +
+                         "' which is not found in the model")
                  .c_str()));
       }
-      const auto& tensor = it->second;
+      auto const &tensor = it->second;
       if (ToDataType(ModelConfigDataTypeToTritonServerDataType(io_dtype)) !=
           tensor->type) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "configuration for model '" + Name() + "' specifies tensor '" +
-                 io_name + "' with type '" + io_dtype +
-                 "', the tensor in the model has type '" +
-                 DataTypeString(tensor->type) + "'")
+            (std::string("configuration for model '" + Name() +
+                         "' specifies tensor '" + io_name + "' with type '" +
+                         io_dtype + "', the tensor in the model has type '" +
+                         DataTypeString(tensor->type) + "'")
                  .c_str()));
       } else if (tensor->type == DT_NONE) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "tensor '" + io_name + "' in the model '" + Name() +
-                 "' has unknown type")
+            (std::string("tensor '" + io_name + "' in the model '" + Name() +
+                         "' has unknown type")
                  .c_str()));
       }
       if (max_batch_size_ != 0) {
@@ -376,17 +386,17 @@ LegionModelState::ValidateModelConfig()
       }
       // put tensor's bound in int64_t to utilize backend common utilities
       std::vector<int64_t> tensor_bounds;
-      for (const auto bound : tensor->bounds) {
+      for (auto const bound : tensor->bounds) {
         tensor_bounds.emplace_back(bound);
       }
       if (dims != tensor_bounds) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "configuration for model '" + Name() + "' specifies tensor '" +
-                 io_name + "' with full shape " + ShapeToString(dims) +
-                 ", the tensor in the model has shape " +
-                 ShapeToString(tensor_bounds))
+            (std::string("configuration for model '" + Name() +
+                         "' specifies tensor '" + io_name +
+                         "' with full shape " + ShapeToString(dims) +
+                         ", the tensor in the model has shape " +
+                         ShapeToString(tensor_bounds))
                  .c_str()));
       }
     }
@@ -395,8 +405,8 @@ LegionModelState::ValidateModelConfig()
   // Outputs
   {
     // Build a map from name to tensors of the model for easy lookup
-    std::map<std::string, Tensor*> tensors;
-    for (const auto& io : outputs_) {
+    std::map<std::string, Tensor *> tensors;
+    for (auto const &io : outputs_) {
       tensors.emplace(io.first, io.second);
     }
 
@@ -407,10 +417,10 @@ LegionModelState::ValidateModelConfig()
     if (ios.ArraySize() > tensors.size()) {
       return TRITONSERVER_ErrorNew(
           TRITONSERVER_ERROR_INVALID_ARG,
-          (std::string(
-               "configuration for model '" + Name() + "' specifies " +
-               std::to_string(ios.ArraySize()) + " outputs, the model has " +
-               std::to_string(tensors.size()))
+          (std::string("configuration for model '" + Name() + "' specifies " +
+                       std::to_string(ios.ArraySize()) +
+                       " outputs, the model has " +
+                       std::to_string(tensors.size()))
                .c_str()));
     }
 
@@ -422,10 +432,11 @@ LegionModelState::ValidateModelConfig()
       // Check datatypes
       std::string io_dtype;
       RETURN_IF_ERROR(io.MemberAsString("data_type", &io_dtype));
-      RETURN_ERROR_IF_TRUE(
-          (io_dtype == "TYPE_STRING"), TRITONSERVER_ERROR_INVALID_ARG,
-          std::string("unsupported datatype '") + io_dtype + "' for tensor '" +
-              io_name + "' for model '" + Name() + "'");
+      RETURN_ERROR_IF_TRUE((io_dtype == "TYPE_STRING"),
+                           TRITONSERVER_ERROR_INVALID_ARG,
+                           std::string("unsupported datatype '") + io_dtype +
+                               "' for tensor '" + io_name + "' for model '" +
+                               Name() + "'");
       // If a reshape is provided for the input then use that when
       // validating that the model matches what is expected.
       std::vector<int64_t> dims;
@@ -435,11 +446,12 @@ LegionModelState::ValidateModelConfig()
       } else {
         RETURN_IF_ERROR(ParseShape(io, "dims", &dims));
       }
-      for (const auto dim : dims) {
+      for (auto const dim : dims) {
         RETURN_ERROR_IF_TRUE(
-            (dim == WILDCARD_DIM), TRITONSERVER_ERROR_INVALID_ARG,
-            std::string(
-                "dynamic tensor is not supported for model '" + Name() + "'"));
+            (dim == WILDCARD_DIM),
+            TRITONSERVER_ERROR_INVALID_ARG,
+            std::string("dynamic tensor is not supported for model '" + Name() +
+                        "'"));
       }
 
       // Check the properties against the corresponding tensor
@@ -447,28 +459,26 @@ LegionModelState::ValidateModelConfig()
       if (it == tensors.end()) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "configuration for model '" + Name() + "' specifies tensor '" +
-                 io_name + "' which is not found in the model")
+            (std::string("configuration for model '" + Name() +
+                         "' specifies tensor '" + io_name +
+                         "' which is not found in the model")
                  .c_str()));
       }
-      const auto& tensor = it->second;
+      auto const &tensor = it->second;
       if (ToDataType(ModelConfigDataTypeToTritonServerDataType(io_dtype)) !=
           tensor->type) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "configuration for model '" + Name() + "' specifies tensor '" +
-                 io_name + "' with type '" + io_dtype +
-                 "', the tensor in the model has type '" +
-                 DataTypeString(tensor->type) + "'")
+            (std::string("configuration for model '" + Name() +
+                         "' specifies tensor '" + io_name + "' with type '" +
+                         io_dtype + "', the tensor in the model has type '" +
+                         DataTypeString(tensor->type) + "'")
                  .c_str()));
       } else if (tensor->type == DT_NONE) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "tensor '" + io_name + "' in the model '" + Name() +
-                 "' has unknown type")
+            (std::string("tensor '" + io_name + "' in the model '" + Name() +
+                         "' has unknown type")
                  .c_str()));
       }
       if (max_batch_size_ != 0) {
@@ -476,80 +486,78 @@ LegionModelState::ValidateModelConfig()
       }
       // put tensor's bound in int64_t to utilize backend common utilities
       std::vector<int64_t> tensor_bounds;
-      for (const auto bound : tensor->bounds) {
+      for (auto const bound : tensor->bounds) {
         tensor_bounds.emplace_back(bound);
       }
       if (dims != tensor_bounds) {
         return TRITONSERVER_ErrorNew(
             TRITONSERVER_ERROR_INVALID_ARG,
-            (std::string(
-                 "configuration for model '" + Name() + "' specifies tensor '" +
-                 io_name + "' with full shape " + ShapeToString(dims) +
-                 ", the tensor in the model has shape " +
-                 ShapeToString(tensor_bounds))
+            (std::string("configuration for model '" + Name() +
+                         "' specifies tensor '" + io_name +
+                         "' with full shape " + ShapeToString(dims) +
+                         ", the tensor in the model has shape " +
+                         ShapeToString(tensor_bounds))
                  .c_str()));
       }
     }
   }
-  return nullptr;  // success
+  return nullptr; // success
 }
 
-TRITONSERVER_Error*
-LegionModelState::SetOutputInfos()
-{
-  for (const auto& output : outputs_) {
+TRITONSERVER_Error *LegionModelState::SetOutputInfos() {
+  for (auto const &output : outputs_) {
     std::vector<int64_t> tensor_bounds;
-    for (const auto bound : output.second->bounds) {
+    for (auto const bound : output.second->bounds) {
       tensor_bounds.emplace_back(bound);
     }
     auto triton_dtype = ToTritonDataType(output.second->type);
     output_infos_.emplace_back(output.first, triton_dtype, tensor_bounds);
   }
-  return nullptr;  // success
+  return nullptr; // success
 }
 
-void
-LegionModelState::LoadLayers(void) const
-{
+void LegionModelState::LoadLayers(void) const {
   std::vector<Realm::Event> loaded_events;
   for (unsigned idx1 = 0; idx1 < layers_.size(); idx1++) {
-    Operator* op = layers_[idx1];
-    const LayerStrategy* config = strategy_->layers[idx1];
+    Operator *op = layers_[idx1];
+    LayerStrategy const *config = strategy_->layers[idx1];
     for (unsigned idx2 = 0; idx2 < config->nProcs; idx2++) {
       Realm::Processor proc = config->local_processors[idx2];
       loaded_events.push_back(runtime_->LoadLayer(proc, op));
     }
   }
   const Realm::Event wait_on = Realm::Event::merge_events(loaded_events);
-  if (wait_on.exists() && !wait_on.has_triggered())
+  if (wait_on.exists() && !wait_on.has_triggered()) {
     wait_on.external_wait();
+  }
 }
 
-void
-LegionModelState::FuseLayers(void)
-{
+void LegionModelState::FuseLayers(void) {
   // FIXME: add support for layer fusion
 }
 
-void
-LegionModelState::FreeLayers(void) const
-{
+void LegionModelState::FreeLayers(void) const {
   std::vector<Realm::Event> freed_events;
   for (unsigned idx1 = 0; idx1 < layers_.size(); idx1++) {
-    Operator* op = layers_[idx1];
-    const LayerStrategy* config = strategy_->layers[idx1];
+    Operator *op = layers_[idx1];
+    LayerStrategy const *config = strategy_->layers[idx1];
     for (unsigned idx2 = 0; idx2 < config->nProcs; idx2++) {
       Realm::Processor proc = config->local_processors[idx2];
       freed_events.push_back(runtime_->FreeLayer(proc, op));
     }
   }
   const Realm::Event wait_on = Realm::Event::merge_events(freed_events);
-  if (wait_on.exists() && !wait_on.has_triggered())
+  if (wait_on.exists() && !wait_on.has_triggered()) {
     wait_on.external_wait();
+  }
   // Delete layers back to front
-  for (std::vector<Operator*>::const_reverse_iterator it = layers_.rbegin();
-       it != layers_.rend(); it++)
+  for (std::vector<Operator *>::const_reverse_iterator it = layers_.rbegin();
+       it != layers_.rend();
+       it++) {
     delete (*it);
+  }
 }
 
-}}}  // namespace triton::backend::legion
+} // namespace legion
+} // namespace backend
+} // namespace triton
diff --git a/triton/src/types.h b/triton/src/types.h
index a034d5f685..b964f3455c 100644
--- a/triton/src/types.h
+++ b/triton/src/types.h
@@ -151,6 +151,7 @@ enum OperatorType {
   OP_PRELU,  // https://github.com/onnx/onnx/blob/master/docs/Operators.md#PRelu
   OP_GELU,
   OP_MULTIHEAD_ATTENTION,
+  OP_INC_MULTIHEAD_SELF_ATTENTION,
   OP_FUSED,  // Fused operator type for internal fusion optimizations
   // Parallel Ops
   OP_REPARTITION,