Key Value Cache Compression
Key-value (KV) cache compression aims to reduce the substantial memory footprint of large language models (LLMs) during inference, a critical bottleneck hindering the deployment of long-context generation. Current research focuses on developing efficient compression techniques, including low-rank approximations, uncertainty-aware compression, and variable compression rates across attention heads, often integrated within existing transformer architectures. These advancements enable significant memory savings and throughput improvements, paving the way for more efficient and scalable LLM deployment in resource-constrained environments.
Papers
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong