Extreme Quantization

Extreme quantization, aiming to drastically reduce the memory footprint and computational cost of neural networks by representing weights and activations with very few bits (e.g., 1-bit or 3-bits), is a rapidly developing area of research. Current efforts focus on improving the accuracy of extremely quantized models, particularly large language models (LLMs) and vision transformers (ViTs), through novel quantization techniques like vector quantization, trellis coded quantization, and adaptive methods that consider the unique properties of different layers or even individual instances. This research is crucial for deploying large models on resource-constrained devices like mobile phones and edge computing platforms, enabling broader accessibility and efficiency in various applications, including scientific computing and machine learning at the edge.

Papers