Weight Only Quantization

Weight-only quantization aims to reduce the memory footprint and computational cost of large language models (LLMs) and other deep learning models by representing their weights using fewer bits, typically 2-4 bits, without retraining. Current research focuses on techniques like vector quantization, adaptive quantization strategies (e.g., per-channel or per-group), and optimized matrix multiplication kernels to minimize accuracy loss at extremely low bit-widths, often employing lookup tables for efficient dequantization. This research is significant because it enables the deployment of increasingly large models on resource-constrained devices, improving both the efficiency and accessibility of advanced AI applications.

Papers