Low Precision Quantization

Low-precision quantization aims to reduce the memory footprint and computational cost of deep neural networks by representing model weights and activations using fewer bits, thereby enabling efficient deployment on resource-constrained devices. Current research focuses on developing post-training quantization techniques for large language models (LLMs) and vision transformers (ViTs), exploring methods like adaptive quantization, output-adaptive calibration, and bias compensation to minimize accuracy loss at extremely low precision (e.g., 2-4 bits). These advancements are crucial for deploying large, computationally expensive models on mobile and embedded systems, impacting both the efficiency of AI applications and the accessibility of advanced AI technologies.

Papers