KV Cache
KV cache, a crucial component in large language model (LLM) inference, aims to accelerate processing by storing previously computed key-value pairs, thereby reducing computational complexity. Current research focuses on optimizing KV cache efficiency through various compression techniques, including quantization, low-rank projection, and selective token eviction, often guided by attention weight analysis and adaptive budget allocation strategies. These advancements are vital for enabling efficient inference of LLMs with expanding context windows, impacting both the scalability of LLM applications and the resource requirements for deploying these powerful models.
Papers
Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
Zeyu Zhang, Haiying Shen
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time
Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu
Palu: Compressing KV-Cache with Low-Rank Projection
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, Kai-Chiang Wu
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo