Ternary Quantization

Ternary quantization is a model compression technique that reduces the memory footprint and computational cost of deep neural networks by representing weights as only -1, 0, and 1. Current research focuses on improving the accuracy of ternary quantized models, particularly for vision transformers (ViTs) and large language models (LLMs), through techniques like optimized quantization algorithms (e.g., those leveraging residual error expansion or hyperspherical learning) and refined training methods (e.g., quantization-aware training). This research is significant because it enables the deployment of powerful deep learning models on resource-constrained devices, broadening the accessibility and applicability of AI across various domains.

Papers