Key Value Cache Compression
Key-value (KV) cache compression aims to reduce the substantial memory footprint of large language models (LLMs) during inference, a critical bottleneck hindering the deployment of long-context generation. Current research focuses on developing efficient compression techniques, including low-rank approximations, uncertainty-aware compression, and variable compression rates across attention heads, often integrated within existing transformer architectures. These advancements enable significant memory savings and throughput improvements, paving the way for more efficient and scalable LLM deployment in resource-constrained environments.
16papers
Papers
March 14, 2025
Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques
Neusha Javidnia, Bita Darvish Rouhani, Farinaz KoushanfarUniversity of California San Diego●NVIDIATime and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding
Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu TianThe University of Chicago●Stevens Institute of Technology●The University of Hong Kong●University of Wisconsin-Madison●The Simons Institute...+2
February 11, 2025
February 3, 2025
December 18, 2024
December 4, 2024
November 26, 2024
October 20, 2024