Model Quantization

Model quantization is a crucial technique for reducing the computational cost and memory footprint of deep learning models, particularly large language models (LLMs) and vision transformers (ViTs), enabling their deployment on resource-constrained devices. Current research focuses on developing efficient quantization methods, including post-training quantization (PTQ) and quantization-aware training (QAT), often employing techniques like knowledge distillation, mixed-precision quantization, and adaptive outlier handling to minimize accuracy loss. This work is significant because it addresses the critical need for deploying powerful deep learning models in energy-efficient and privacy-preserving ways, impacting various applications from speech recognition and image processing to mobile device AI.

Papers