Key Value Cache
Key-value (KV) caching is a crucial technique for accelerating large language model (LLM) inference by storing intermediate computations, but its memory consumption scales linearly with sequence length, hindering efficient deployment. Current research focuses on optimizing KV cache management through various strategies, including low-rank compression, layer-wise allocation and offloading, sliding window attention, and quantization techniques, often combined with novel attention mechanisms or model architectures like MixAttention. These advancements aim to reduce memory footprint and improve inference speed and throughput, significantly impacting the scalability and cost-effectiveness of LLMs in practical applications.
Papers
October 30, 2024
October 28, 2024
October 25, 2024
October 20, 2024
October 18, 2024
October 17, 2024
October 4, 2024
October 1, 2024
September 23, 2024
September 16, 2024
August 7, 2024
July 25, 2024
July 22, 2024
July 13, 2024
July 1, 2024
June 28, 2024
June 26, 2024
June 24, 2024
June 18, 2024