Quantization Operator
Quantization is a model compression technique that reduces the precision of numerical representations in neural networks, aiming to decrease computational costs and memory footprint while preserving model accuracy. Current research focuses on applying quantization to various deep learning architectures, including Vision Transformers (ViTs), large language models (LLMs), and diffusion models, often employing post-training quantization (PTQ) methods to avoid retraining the entire model. This work is significant because it enables the deployment of large, computationally expensive models on resource-constrained devices, impacting fields like healthcare, edge computing, and natural language processing by making advanced AI more accessible and efficient.
Papers
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
Baisong Li, Xingwang Wang, Haixiao Xu
Effective Quantization for Diffusion Models on CPUs
Hanwen Chang, Haihao Shen, Yiyang Cai, Xinyu Ye, Zhenzhong Xu, Wenhua Cheng, Kaokao Lv, Weiwei Zhang, Yintong Lu, Heng Guo
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao
A Carbon Tracking Model for Federated Learning: Impact of Quantization and Sparsification
Luca Barbieri, Stefano Savazzi, Sanaz Kianoush, Monica Nicoli, Luigi Serio