Weight Only Quantization
Weight-only quantization aims to reduce the memory footprint and computational cost of large language models (LLMs) and other deep learning models by representing their weights using fewer bits, typically 2-4 bits, without retraining. Current research focuses on techniques like vector quantization, adaptive quantization strategies (e.g., per-channel or per-group), and optimized matrix multiplication kernels to minimize accuracy loss at extremely low bit-widths, often employing lookup tables for efficient dequantization. This research is significant because it enables the deployment of increasingly large models on resource-constrained devices, improving both the efficiency and accessibility of advanced AI applications.
Papers
Channel-Wise Mixed-Precision Quantization for Large Language Models
Zihan Chen, Bike Xie, Jundong Li, Cong Shen
DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs
Yingsong Luo, Ling Chen
COMET: Towards Partical W4A4KV4 LLMs Serving
Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang