📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
-
Updated
Dec 22, 2024
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
Shush is an app that deploys a WhisperV3 model with Flash Attention v2 on Modal and makes requests to it via a NextJS app
Triton implementation of FlashAttention2 that adds Custom Masks.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Uses the powerful WhisperS2T and Ctranslate2 libraries to batch transcribe multiple files
Flash Attention Implementation with Multiple Backend Support and Sharding This module provides a flexible implementation of Flash Attention with support for different backends (GPU, TPU, CPU) and platforms (Triton, Pallas, JAX).
Poplar implementation of FlashAttention for IPU
Toy Flash Attention implementation in torch
Transcribe audio in minutes with OpenAI's WhisperV3 and Flash Attention v2 + Transformers without relying on third-party providers and APIs. Host it yourself or try it out.
Add a description, image, and links to the flash-attention-2 topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention-2 topic, visit your repo's landing page and select "manage topics."