Quantization Operator
Quantization is a model compression technique that reduces the precision of numerical representations in neural networks, aiming to decrease computational costs and memory footprint while preserving model accuracy. Current research focuses on applying quantization to various deep learning architectures, including Vision Transformers (ViTs), large language models (LLMs), and diffusion models, often employing post-training quantization (PTQ) methods to avoid retraining the entire model. This work is significant because it enables the deployment of large, computationally expensive models on resource-constrained devices, impacting fields like healthcare, edge computing, and natural language processing by making advanced AI more accessible and efficient.
Papers
COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization
Aozhong Zhang, Zi Yang, Naigang Wang, Yingyong Qi, Jack Xin, Xin Li, Penghang Yin
What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
Shuai Tan, Bin Ji, Ye Pan
HEQuant: Marrying Homomorphic Encryption and Quantization for Communication-Efficient Private Inference
Tianshi Xu, Meng Li, Runsheng Wang
LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection
Sifan Zhou, Liang Li, Xinyu Zhang, Bo Zhang, Shipeng Bai, Miao Sun, Ziyu Zhao, Xiaobo Lu, Xiangxiang Chu