Weight Only Quantization
Weight-only quantization aims to reduce the memory footprint and computational cost of large language models (LLMs) and other deep learning models by representing their weights using fewer bits, typically 2-4 bits, without retraining. Current research focuses on techniques like vector quantization, adaptive quantization strategies (e.g., per-channel or per-group), and optimized matrix multiplication kernels to minimize accuracy loss at extremely low bit-widths, often employing lookup tables for efficient dequantization. This research is significant because it enables the deployment of increasingly large models on resource-constrained devices, improving both the efficiency and accessibility of advanced AI applications.
Papers
October 3, 2023
September 27, 2023
September 24, 2023
September 11, 2023
September 6, 2023
August 16, 2023
February 4, 2023
December 12, 2022
June 20, 2022