Skip to content
View Bruce-Lee-LY's full-sized avatar

Block or report Bruce-Lee-LY

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Pinned Loading

  1. decoding_attention decoding_attention Public

    Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.

    C++ 28 1

  2. flash_attention_inference flash_attention_inference Public

    Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

    C++ 32 3

  3. cuda_hgemm cuda_hgemm Public

    Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

    Cuda 323 68

  4. cuda_hook cuda_hook Public

    Hooked CUDA-related dynamic libraries by using automated code generation tools.

    C 142 38

  5. cuda_hgemv cuda_hgemv Public

    Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

    Cuda 52 4

  6. cutlass_gemm cutlass_gemm Public

    Multiple GEMM operators are constructed with cutlass to support LLM inference.

    C++ 14 2