Quantization Operator
Quantization is a model compression technique that reduces the precision of numerical representations in neural networks, aiming to decrease computational costs and memory footprint while preserving model accuracy. Current research focuses on applying quantization to various deep learning architectures, including Vision Transformers (ViTs), large language models (LLMs), and diffusion models, often employing post-training quantization (PTQ) methods to avoid retraining the entire model. This work is significant because it enables the deployment of large, computationally expensive models on resource-constrained devices, impacting fields like healthcare, edge computing, and natural language processing by making advanced AI more accessible and efficient.
Papers
A Practical Mixed Precision Algorithm for Post-Training Quantization
Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag Patel, Tijmen Blankevoort
Feature Affinity Assisted Knowledge Distillation and Quantization of Deep Neural Networks on Label-Free Data
Zhijian Li, Biao Yang, Penghang Yin, Yingyong Qi, Jack Xin