Low Bit Quantization
Low-bit quantization aims to reduce the memory footprint and computational cost of large language models (LLMs) and other deep learning models by representing their weights and activations using fewer bits, thereby accelerating inference and enabling deployment on resource-constrained devices. Current research focuses on developing novel quantization algorithms, including post-training quantization (PTQ) and quantization-aware training (QAT) methods, often tailored to specific model architectures like transformers and convolutional neural networks. These advancements are significant because they address the critical bottleneck of deploying large, computationally expensive models, impacting both the efficiency of research and the accessibility of powerful AI applications.
Papers
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu