Quantization Scale
Quantization scale optimization aims to reduce the precision of neural network weights and activations, thereby decreasing model size and improving inference speed without significant accuracy loss. Current research focuses on developing efficient algorithms, such as evolutionary search and contrastive learning, to find optimal quantization scales, particularly for large language models (LLMs) and vision transformers (ViTs), often employing techniques like vector quantization and lookup tables. These advancements are crucial for deploying large models on resource-constrained devices and accelerating inference, impacting both the efficiency of AI systems and their accessibility.
Papers
July 15, 2024
February 23, 2024
August 21, 2023
May 24, 2023
March 23, 2023