Quantization Operator
Quantization is a model compression technique that reduces the precision of numerical representations in neural networks, aiming to decrease computational costs and memory footprint while preserving model accuracy. Current research focuses on applying quantization to various deep learning architectures, including Vision Transformers (ViTs), large language models (LLMs), and diffusion models, often employing post-training quantization (PTQ) methods to avoid retraining the entire model. This work is significant because it enables the deployment of large, computationally expensive models on resource-constrained devices, impacting fields like healthcare, edge computing, and natural language processing by making advanced AI more accessible and efficient.
Papers
Intriguing Properties of Quantization at Scale
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker
Stochastic Gradient Langevin Dynamics Based on Quantization with Increasing Resolution
JInwuk Seok, Changsik Cho
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
Zhuocheng Gong, Jiahao Liu, Qifan Wang, Yang Yang, Jingang Wang, Wei Wu, Yunsen Xian, Dongyan Zhao, Rui Yan