KV Cache
KV cache, a crucial component in large language model (LLM) inference, aims to accelerate processing by storing previously computed key-value pairs, thereby reducing computational complexity. Current research focuses on optimizing KV cache efficiency through various compression techniques, including quantization, low-rank projection, and selective token eviction, often guided by attention weight analysis and adaptive budget allocation strategies. These advancements are vital for enabling efficient inference of LLMs with expanding context windows, impacting both the scalability of LLM applications and the resource requirements for deploying these powerful models.
Papers
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos