Multiplier Free Quantization
Multiplier-free quantization aims to reduce the computational cost and memory footprint of deep learning models, particularly large language models (LLMs) and vision transformers, by representing model weights and activations using lower bit-widths without significant accuracy loss. Current research focuses on developing novel quantization algorithms, including post-training quantization (PTQ) and quantization-aware training (QAT) methods, often incorporating techniques like activation smoothing and outlier management to mitigate performance degradation at low bit-widths. This research is crucial for deploying large, computationally expensive models on resource-constrained devices, such as mobile phones and edge computing platforms, thereby broadening the accessibility and applicability of advanced AI systems.
Papers
Intriguing Properties of Quantization at Scale
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker
Low Precision Quantization-aware Training in Spiking Neural Networks with Differentiable Quantization Function
Ayan Shymyrbay, Mohammed E. Fouda, Ahmed Eltawil
Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective
Yuexiao Ma, Huixia Li, Xiawu Zheng, Xuefeng Xiao, Rui Wang, Shilei Wen, Xin Pan, Fei Chao, Rongrong Ji
Fighting over-fitting with quantization for learning deep neural networks on noisy labels
Gauthier Tallec, Edouard Yvinec, Arnaud Dapogny, Kevin Bailly