Mixed Precision Quantization

Mixed-precision quantization aims to improve the efficiency of deep neural networks by assigning different numerical precisions (bit-widths) to various layers or components, balancing accuracy and resource consumption. Current research focuses on developing algorithms that automatically determine optimal bit-width allocations for various architectures, including transformers (Vision Transformers, and Large Language Models) and convolutional neural networks, often incorporating techniques like quantization-aware training and hardware-aware optimization. This optimization technique is crucial for deploying large models on resource-constrained devices like embedded systems and mobile platforms, impacting both the accessibility and energy efficiency of AI applications.

Papers