Multiplier Free Quantization
Multiplier-free quantization aims to reduce the computational cost and memory footprint of deep learning models, particularly large language models (LLMs) and vision transformers, by representing model weights and activations using lower bit-widths without significant accuracy loss. Current research focuses on developing novel quantization algorithms, including post-training quantization (PTQ) and quantization-aware training (QAT) methods, often incorporating techniques like activation smoothing and outlier management to mitigate performance degradation at low bit-widths. This research is crucial for deploying large, computationally expensive models on resource-constrained devices, such as mobile phones and edge computing platforms, thereby broadening the accessibility and applicability of advanced AI systems.
Papers
RepQ: Generalizing Quantization-Aware Training for Re-Parametrized Architectures
Anastasiia Prutianova, Alexey Zaytsev, Chung-Kuei Lee, Fengyu Sun, Ivan Koryakovskiy
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
Jangwhan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi