Cache Quantization
Cache quantization aims to reduce the memory footprint and improve the inference speed of large language models (LLMs) by representing key-value (KV) cache activations with fewer bits. Current research focuses on developing novel quantization techniques, including low-rank projections, Johnson-Lindenstrauss transforms, and adaptive quantization schemes that prioritize important tokens or channels, often applied to models like Llama and Mistral. These advancements enable significant memory compression and speed improvements for LLMs, facilitating the processing of longer sequences and larger batch sizes, thereby impacting both the scalability and efficiency of LLM deployment.
Papers
September 25, 2024
July 30, 2024
June 5, 2024
May 23, 2024
May 10, 2024
February 5, 2024