Quantization Aware Training
Quantization-aware training (QAT) aims to improve the efficiency of deep learning models by training them to operate directly with low-precision numerical representations (e.g., 4-bit or 8-bit integers), minimizing accuracy loss compared to full-precision models. Current research focuses on applying QAT to large language models (LLMs) and other resource-intensive architectures like transformers and diffusion models, exploring techniques like mixed-precision quantization, accumulator-aware quantization, and the use of novel quantization functions and regularization methods to enhance accuracy and stability. This work is significant because it enables the deployment of powerful deep learning models on resource-constrained devices, such as mobile phones and embedded systems, while also reducing energy consumption and computational costs.
Papers
BinaryDM: Accurate Weight Binarization for Efficient Diffusion Models
Xingyu Zheng, Xianglong Liu, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, Michele Magno
Investigating the Impact of Quantization on Adversarial Robustness
Qun Li, Yuan Meng, Chen Tang, Jiacheng Jiang, Zhi Wang
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, Yanzhi Wang
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu