Post Training Quantization
Post-training quantization (PTQ) aims to reduce the computational cost and memory footprint of large neural networks, particularly large language models (LLMs) and vision transformers (ViTs), without retraining. Current research focuses on improving PTQ accuracy at extremely low bit-widths (e.g., 2-4 bits) through techniques like vector quantization, adaptive quantization schemes (e.g., per-channel, mixed-precision), and optimization strategies that minimize quantization error by addressing issues such as outliers and activation distribution. This work is significant because efficient quantization is crucial for deploying large models on resource-constrained devices, enabling broader accessibility and reducing the environmental impact of AI.
Papers
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi