Activation Quantization

Activation quantization aims to reduce the memory footprint and computational cost of large neural networks, particularly large language models (LLMs) and vision transformers (ViTs), by representing activations with fewer bits without significant accuracy loss. Current research focuses on mitigating the negative effects of outlier activations through techniques like rotation, outlier preservation, and channel-wise quantization, often in conjunction with weight quantization and parameter-efficient fine-tuning methods such as LoRA. These advancements are crucial for deploying increasingly complex models on resource-constrained devices and improving the efficiency of large-scale model training and inference.

Papers