Post Training Quantization
Post-training quantization (PTQ) aims to reduce the computational cost and memory footprint of large neural networks, particularly large language models (LLMs) and vision transformers (ViTs), without retraining. Current research focuses on improving PTQ accuracy at extremely low bit-widths (e.g., 2-4 bits) through techniques like vector quantization, adaptive quantization schemes (e.g., per-channel, mixed-precision), and optimization strategies that minimize quantization error by addressing issues such as outliers and activation distribution. This work is significant because efficient quantization is crucial for deploying large models on resource-constrained devices, enabling broader accessibility and reducing the environmental impact of AI.
Papers
Q-VLM: Post-training Quantization for Large Vision-Language Models
Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang