Mixed Precision Quantization
Mixed-precision quantization aims to improve the efficiency of deep neural networks by assigning different numerical precisions (bit-widths) to various layers or components, balancing accuracy and resource consumption. Current research focuses on developing algorithms that automatically determine optimal bit-width allocations for various architectures, including transformers (Vision Transformers, and Large Language Models) and convolutional neural networks, often incorporating techniques like quantization-aware training and hardware-aware optimization. This optimization technique is crucial for deploying large models on resource-constrained devices like embedded systems and mobile platforms, impacting both the accessibility and energy efficiency of AI applications.
Papers
Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models
Keith G. Mills, Mohammad Salameh, Ruichen Chen, Negar Hassanpour, Wei Lu, Di Niu
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Zhen Zheng, Xiaonan Song, Chuanjie Liu