Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
gpu cuda inference nvidia mha multi-head-attention llm large-language-model flash-attention cuda-core decoding-attention flashinfer
-
Updated
Nov 5, 2024 - C++