Neural Network Quantization
Neural network quantization aims to reduce the memory footprint and computational cost of deep learning models by representing their weights and activations using lower precision numbers (e.g., 1-bit, 2-bit, 4-bit), thereby enabling deployment on resource-constrained devices. Current research focuses on developing efficient quantization algorithms, including mixed-precision techniques that assign different bitwidths to different layers, and exploring novel quantization schemes beyond uniform quantization to minimize accuracy loss. This area is crucial for advancing the practical applicability of large language models and other computationally intensive neural networks, impacting fields ranging from mobile device applications to energy-efficient edge computing.
Papers
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition
Junhao Xu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng
Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers
Junhao Xu, Xie Chen, Shoukang Hu, Jianwei Yu, Xunying Liu, Helen Meng