Key Value Cache
Key-value (KV) caching is a crucial technique for accelerating large language model (LLM) inference by storing intermediate computations, but its memory consumption scales linearly with sequence length, hindering efficient deployment. Current research focuses on optimizing KV cache management through various strategies, including low-rank compression, layer-wise allocation and offloading, sliding window attention, and quantization techniques, often combined with novel attention mechanisms or model architectures like MixAttention. These advancements aim to reduce memory footprint and improve inference speed and throughput, significantly impacting the scalability and cost-effectiveness of LLMs in practical applications.
Papers
June 18, 2024
June 17, 2024
June 5, 2024
May 23, 2024
May 21, 2024
May 17, 2024
May 8, 2024
May 7, 2024
April 22, 2024
April 18, 2024
April 15, 2024
April 7, 2024
March 14, 2024
March 7, 2024
February 28, 2024
February 14, 2024
February 9, 2024
February 5, 2024
September 12, 2023