Post Training Quantization
Post-training quantization (PTQ) aims to reduce the computational cost and memory footprint of large neural networks, particularly large language models (LLMs) and vision transformers (ViTs), without retraining. Current research focuses on improving PTQ accuracy at extremely low bit-widths (e.g., 2-4 bits) through techniques like vector quantization, adaptive quantization schemes (e.g., per-channel, mixed-precision), and optimization strategies that minimize quantization error by addressing issues such as outliers and activation distribution. This work is significant because efficient quantization is crucial for deploying large models on resource-constrained devices, enabling broader accessibility and reducing the environmental impact of AI.
Papers
Post-Training Quantization for Energy Efficient Realization of Deep Neural Networks
Cecilia Latotzke, Batuhan Balim, Tobias Gemmeke
Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization
Manuele Rusci, Marco Fariselli, Martin Croome, Francesco Paci, Eric Flamand
Symmetry Regularization and Saturating Nonlinearity for Robust Quantization
Sein Park, Yeongsang Jang, Eunhyeok Park
CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks
Muhammad Abdullah Hanif, Giuseppe Maria Sarda, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique