KV Cache
KV cache, a crucial component in large language model (LLM) inference, aims to accelerate processing by storing previously computed key-value pairs, thereby reducing computational complexity. Current research focuses on optimizing KV cache efficiency through various compression techniques, including quantization, low-rank projection, and selective token eviction, often guided by attention weight analysis and adaptive budget allocation strategies. These advancements are vital for enabling efficient inference of LLMs with expanding context windows, impacting both the scalability of LLM applications and the resource requirements for deploying these powerful models.
31papers
Papers
March 24, 2025
xKV: Cross-Layer SVD for KV-Cache Compression
Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. AbdelfattahCornell University●University of Washington●National Yang Ming Chiao Tung UniversityOaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
Minsu Kim, Seongmin Hong, RyeoWook Ko, Soongyu Choi, Hunjong Lee, Junsoo Kim, Joo-Young Kim, Jongse ParkKAIST●HyperAccel
March 17, 2025
February 24, 2025
KV-Edit: Training-Free Image Editing for Precise Background Preservation
Tianrui Zhu, Shiyi Zhang, Jiawei Shao, Yansong TangTsinghua University●China TelecomDBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance
Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji LiWeChat AIKVCrush: Key value cache size-reduction using similarity in head-behaviour
Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Alexander Kozlov, Nilesh JainIntel Corporation
February 21, 2025
February 19, 2025
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey TumanovETS: Efficient Tree Search for Inference-Time Scaling
Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao+2UC Berkeley●ICSI●LBNLFairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu