Quantization Aware Training
Quantization-aware training (QAT) aims to improve the efficiency of deep learning models by training them to operate directly with low-precision numerical representations (e.g., 4-bit or 8-bit integers), minimizing accuracy loss compared to full-precision models. Current research focuses on applying QAT to large language models (LLMs) and other resource-intensive architectures like transformers and diffusion models, exploring techniques like mixed-precision quantization, accumulator-aware quantization, and the use of novel quantization functions and regularization methods to enhance accuracy and stability. This work is significant because it enables the deployment of powerful deep learning models on resource-constrained devices, such as mobile phones and embedded systems, while also reducing energy consumption and computational costs.
Papers
Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding
Zi Yang, Samridhi Choudhary, Siegfried Kunzmann, Zheng Zhang
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee