Post Training Quantization

Post-training quantization (PTQ) aims to reduce the computational cost and memory footprint of large neural networks, particularly large language models (LLMs) and vision transformers (ViTs), without retraining. Current research focuses on improving PTQ accuracy at extremely low bit-widths (e.g., 2-4 bits) through techniques like vector quantization, adaptive quantization schemes (e.g., per-channel, mixed-precision), and optimization strategies that minimize quantization error by addressing issues such as outliers and activation distribution. This work is significant because efficient quantization is crucial for deploying large models on resource-constrained devices, enabling broader accessibility and reducing the environmental impact of AI.

Papers