Post Training Quantization
Post-training quantization (PTQ) aims to reduce the computational cost and memory footprint of large neural networks, particularly large language models (LLMs) and vision transformers (ViTs), without retraining. Current research focuses on improving PTQ accuracy at extremely low bit-widths (e.g., 2-4 bits) through techniques like vector quantization, adaptive quantization schemes (e.g., per-channel, mixed-precision), and optimization strategies that minimize quantization error by addressing issues such as outliers and activation distribution. This work is significant because efficient quantization is crucial for deploying large models on resource-constrained devices, enabling broader accessibility and reducing the environmental impact of AI.
Papers
Two Heads are Better Than One: Neural Networks Quantization with 2D Hilbert Curve-based Output Representation
Mykhailo Uss, Ruslan Yermolenko, Olena Kolodiazhna, Oleksii Shashko, Ivan Safonov, Volodymyr Savin, Yoonjae Yeo, Seowon Ji, Jaeyun Jeong
AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs
Alireza Ghaffari, Sharareh Younesian, Vahid Partovi Nia, Boxing Chen, Masoud Asgharian