MoE BMM FP8 rowwise

Summary: Enable MoE BMM FP8 rowwise: - MoE BMM FP8 rowwise achieves **up to 4.5x (2.1x on average) speedup compared to BF16 BMM** - In E2E with MoE 16b x 16, FP8 with BMM achieves **1.2x speedup than BF16** - Integrated in E2E and verified correctness which matches BF16 generations - More results are in the [data sheet](https://docs.google.com/spreadsheets/d/1OLdz4MlzWS9pdgTBq4Jjy0-9_nPn-NmdrMolY0jZOXE/edit?gid=0#gid=0) {F1903027122} Differential Revision: D63681109
pytorch · Oct 2, 2024 · e3d6b1e · e3d6b1e
1 parent c24a72d
commit e3d6b1e
Show file tree

Hide file tree

Showing 5 changed files with 560 additions and 1 deletion.
diff --git a/fbgemm_gpu/experimental/gen_ai/CMakeLists.txt b/fbgemm_gpu/experimental/gen_ai/CMakeLists.txt
@@ -43,6 +43,7 @@ else()
     src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu
     src/quantize/cutlass_extensions/f8f8bf16_cublas.cu
     src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu
+    src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched.cu
     src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu
     src/quantize/cutlass_extensions/i8i8bf16.cu
     src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu