Cache Quantization

Cache quantization aims to reduce the memory footprint and improve the inference speed of large language models (LLMs) by representing key-value (KV) cache activations with fewer bits. Current research focuses on developing novel quantization techniques, including low-rank projections, Johnson-Lindenstrauss transforms, and adaptive quantization schemes that prioritize important tokens or channels, often applied to models like Llama and Mistral. These advancements enable significant memory compression and speed improvements for LLMs, facilitating the processing of longer sequences and larger batch sizes, thereby impacting both the scalability and efficiency of LLM deployment.

Papers