Key Value Cache

Key-value (KV) caching is a crucial technique for accelerating large language model (LLM) inference by storing intermediate computations, but its memory consumption scales linearly with sequence length, hindering efficient deployment. Current research focuses on optimizing KV cache management through various strategies, including low-rank compression, layer-wise allocation and offloading, sliding window attention, and quantization techniques, often combined with novel attention mechanisms or model architectures like MixAttention. These advancements aim to reduce memory footprint and improve inference speed and throughput, significantly impacting the scalability and cost-effectiveness of LLMs in practical applications.

Papers