Key Value Cache
Key-value (KV) caching is a crucial technique for accelerating large language model (LLM) inference by storing intermediate computations, but its memory consumption scales linearly with sequence length, hindering efficient deployment. Current research focuses on optimizing KV cache management through various strategies, including low-rank compression, layer-wise allocation and offloading, sliding window attention, and quantization techniques, often combined with novel attention mechanisms or model architectures like MixAttention. These advancements aim to reduce memory footprint and improve inference speed and throughput, significantly impacting the scalability and cost-effectiveness of LLMs in practical applications.
Papers
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang, Hui Chen, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen