Weight Quantization
Weight quantization is a model compression technique aiming to reduce the memory footprint and computational cost of deep neural networks by representing weights with lower precision (e.g., 2-bit, 4-bit integers instead of 32-bit floats). Current research focuses on developing quantization methods for various architectures, including large language models (LLMs), vision transformers (ViTs), and spiking neural networks (SNNs), often employing techniques like knowledge distillation, activation quantization, and loss-aware training to mitigate accuracy loss. This research is significant because efficient model compression is crucial for deploying large models on resource-constrained devices and reducing the environmental impact of AI, impacting both the efficiency of AI systems and their accessibility.
Papers
Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models
Aayush Saxena, Arit Kumar Bishwas, Ayush Ashok Mishra, Ryan Armstrong
Differentiable Product Quantization for Memory Efficient Camera Relocalization
Zakaria Laskar, Iaroslav Melekhov, Assia Benbihi, Shuzhe Wang, Juho Kannala