Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes. This repository is for personal use of learning and classifying the burning KV Cache related papers!
- 📖Trending Inference Topics🔥🔥🔥
- 📖KV Cache Compression🔥🔥
- 📖KV Cache Merge🔥🔥
- 📖Budget Allocation🔥
- 📖Cross-Layer KV Cache Utilization🔥
- 📖KV Cache Quantization🔥
- 📖Low-Rank KV Cache Decomposition🔥
- 📖Observation🔥🔥
- 📖Evaluation🔥
- 📖Systems
- 📖Others
📖Trending Inference Topics (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.05 | 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) | [pdf] | [DeepSeek-V2] | ⭐️⭐️⭐️ | |
2024.05 | 🔥🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) | [pdf] | [unilm-YOCO] | ⭐️⭐️⭐️ | |
2024.06 | 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) | [pdf] | [Mooncake] | ⭐️⭐️⭐️ | |
2024.07 | 🔥🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) | [pdf] | [flash-attention] | ⭐️⭐️⭐️ | |
2024.07 | 🔥🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) | [pdf] | [MInference 1.0] | ⭐️⭐️⭐️ |
LLM KV Cache Compression (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2023.06 | 🔥🔥[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | [pdf] | [H2O] | ⭐️⭐️⭐️ | Attention-based selection |
2023.09 | 🔥🔥🔥[StreamingLLM] Efficient Streaming Language Models with Attention Sinks | [pdf] | [streaming-llm] | ⭐️⭐️⭐️ | Retain first few tokens |
2023.10 | 🔥[FastGen] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | [pdf] | ⭐️⭐️ | Head-specific compression strategies | |
2023.10 | 🔥🔥[CacheGen] KV Cache Compression and Streaming for Fast Large Language Model Serving | [pdf] | [LMCache] | ⭐️⭐️⭐️ | Compress KV cache to bitstreams for storage and sharing |
2024.04 | 🔥🔥[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation | [pdf] | [SnapKV] | ⭐️⭐️⭐️ | Attention Pooling before selection |
2024.05 | [Scissorhands] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | [pdf] | ⭐️ | ||
2024.06 | 🔥A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression | [pdf] | ⭐️ | L2 Norm is better than attention as a metrics | |
2024.06 | CORM: Cache Optimization with Recent Message for Large Language Model Inference | [pdf] | ⭐️ | ||
2024.07 | Efficient Sparse Attention needs Adaptive Token Release | [pdf] | ⭐️ | ||
2024.03 | [ALISA] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching | [pdf] | ⭐️ | ||
2024.03 | 🔥🔥🔥[FastV] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models | [pdf] | [EasyKV] | ⭐️⭐️⭐️ | |
2024.03 | [Keyformer] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | [pdf] | [keyformer-llm] | ⭐️⭐️ | |
2024.06 | Effectively Compress KV Heads for LLM | [pdf] | ⭐️ | ||
2024.06 | 🔥 Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters | [pdf] | ⭐️ | ||
2024.06 | On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference | [pdf] | [EasyKV] | ⭐️ |
KV Cache Merge (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2023.10 | 🔥🔥[CacheBlend] Fast Large Language Model Serving for RAG with Cached Knowledge Fusion | [pdf] | [LMCache] | ⭐️⭐️⭐️ | Selective update when merging KV caches |
2023.12 | 🔥 Compressed Context Memory For Online Language Model Interaction | [pdf] | [ContextMemory] | ⭐️⭐️⭐️ | Finetuning LLMs to recurrently compress KV caches |
2024.01 | [CaM] CaM: Cache Merging for Memory-efficient LLMs Inference | [pdf] | [cam] | ⭐️⭐️ | |
2024.05 | 🔥🔥 You Only Cache Once: Decoder-Decoder Architectures for Language Models | [pdf] | [unilm] | ⭐️⭐️ | |
2024.06 | 🔥🔥[D2O] D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models | [pdf] | ⭐️⭐️⭐️ | ||
2024.07 | 🔥 [KVMerger]Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks | [pdf] | ⭐️⭐️⭐️ |
Budget Allocation (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.05 | 🔥[PyramidInfer] PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference | [pdf] | [PyramidInfer] | ⭐️⭐️⭐️ | Layer-wise budget allocation |
2024.06 | 🔥[PyramidKV] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | [pdf] | [PyramidKV] | ⭐️⭐️⭐️ | Layer-wise budget allocation |
2024.07 | 🔥[Ada-KV] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference | [pdf] | ⭐️⭐️⭐️ | Head-wise budget allocation | |
2024.07 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | [pdf] | ⭐️ |
Cross-Layer KV Cache Utilization (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.05 | 🔥 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention | [pdf] | ⭐️ | ||
2024.05 | 🔥 Layer-Condensed KV Cache for Efficient Inference of Large Language Models | [pdf] | [LCKV] | ⭐️⭐️ | |
2024.05 | 🔥🔥🔥[MiniCache] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | [pdf] | ⭐️⭐️⭐️ | ||
2024.06 | 🔥[MLKV] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding | [pdf] | [pythia-mlkv] | ⭐️⭐️ |
KV Cache Quantization (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2023.03 | 🔥[GEAR] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM | [pdf] | [GEAR] | ⭐️⭐️ | |
2024.01 | 🔥🔥[KVQuant] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | [pdf] | [KVQuant] | ⭐️⭐️ | Make all KV cache quantized |
2024.02 | [No Token Left Behind] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization | [pdf] | ⭐️⭐️⭐️ | ||
2024.02 | [KIVI] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache | [pdf] | [KIVI] | ⭐️⭐️ | |
2024.02 | [WKVQuant] WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More | [pdf] | |||
2024.03 | [QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache | [pdf] | [QAQ-KVCacheQuantization] | ⭐️ | attention-based KV cache quantized |
2024.05 | [ZipCache] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification | [pdf] | ⭐️ | ||
2024.05 | Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression | [pdf] | ⭐️ | ||
2024.05 | [SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | [pdf] | [SKVQ] | ⭐️ | |
2024.07 | [PQCache] PQCache: Product Quantization-based KVCache for Long Context LLM Inference | [pdf] | ⭐️ |
https://arxiv.org/abs/2402.12065
Evaluation (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.07 | 🔥[Benchmark] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches | [pdf] | ⭐️ |
Low Rank KV Cache Decomposition (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.02 | Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference | [pdf] | [LESS] | ⭐️⭐️⭐️ | Fine-tune to make the KV cache low-ranked |
2024.05 | 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | [pdf] | [DeepSeek-V2] | ⭐️⭐️⭐️ | Train low-rank KV cache from scratch |
2024.06 | [Loki] Loki: Low-Rank Keys for Efficient Sparse Attention | [pdf] | ⭐️ |
Observation (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2022.09 | In-context Learning and Induction Heads | [pdf] | ⭐️⭐️ | ||
2024.01 | 🔥Transformers are Multi-State RNNs | [pdf] | [TOVA] | ⭐️⭐️ | |
2024.04 | 🔥[Retrieval Head] Retrieval Head Mechanistically Explains Long-Context Factuality | [pdf] | [Retrieval_Head] | ⭐️⭐️⭐️ | |
2024.04 | 🔥[Massive Activations] Massive Activations in Large Language Models | [pdf] | [Massive Activation] | ⭐️⭐️⭐️ |
Systems (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.06 | 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) | [pdf] | [Mooncake] | ⭐️⭐️⭐️ | |
2024.02 | MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | [pdf] | ⭐️ |
Others (©️back👆🏻)
Date | Title | Paper | Code | Recom | Comment |
---|---|---|---|---|---|
2024.02 | Effectively Compress KV Heads for LLM | [pdf] | ⭐️ | ||
2024.07 | 🔥🔥Q-Sparse: All Large Language Models can be Fully Sparsely-Activated | [pdf] | [GeneralAI] | ⭐️⭐️⭐️ |
GNU General Public License v3.0
Welcome to star & submit a PR to this repo!
@misc{Awesome-LLM-Inference@2024,
title={Awesome-LLM-KV-Cache: A curated list of Awesome LLM Inference Papers with codes},
url={https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
note={Open-source software available at https://github.com/Zefan-Cai/Awesome-LLM-KV-Cache},
author={Zefan-Cai, etc},
year={2024}
}