FP8 KV + Disagg unit test (#3218)

Summary: Pull Request resolved: #3218 X-link: facebookresearch/FBGEMM#315 Adding the fp8 kv cache to disagg test for mp2. Changes include changing the model to 7b llama model. The small model has D_H of 64, which is not working with dequantization kernel (will check the issue in another diff). TODO: add Fp8 kv cache + paged kv to the test Reviewed By: jianyuh Differential Revision: D62772678 fbshipit-source-id: 775f572e2c345354844e24d80e2481284ac6f1a3
pytorch · Oct 3, 2024 · f4710c1 · f4710c1
1 parent 788cd2a
commit f4710c1
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu b/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu
@@ -1437,6 +1437,7 @@ __global__ void dequantize_fp8_cache_kernel(
   auto MAX_T = cache_K.size(1);
   auto D_H = cache_K_dq.size(3);
   auto D_H_q = cache_K.size(3);
+  // TODO: support D_H < 128 for small model used in testing.
   CUDA_KERNEL_ASSERT(D_H == 128);
 
   auto b = blockIdx.x;