KV Cache Compression
KV cache compression aims to reduce the substantial memory footprint of key-value caches used in large language models (LLMs) during inference, thereby improving efficiency and enabling longer context windows. Current research focuses on techniques like low-rank matrix decomposition, adaptive merging of key-value states, and selective token dropping based on attention patterns, often incorporating algorithms such as FlashAttention for speed improvements. These advancements are crucial for deploying LLMs on resource-constrained hardware and scaling up applications requiring long-context processing, such as complex question answering and code generation.
Papers
ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty
Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression
Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao