Layer Wise Quantization

Layer-wise quantization aims to improve the efficiency of deep neural networks by reducing the precision of individual layers' weights and activations, thereby decreasing computational cost and memory footprint. Current research focuses on developing effective post-training quantization methods for various architectures, including Vision Transformers and Large Language Models, often employing techniques like accumulator-aware quantization and learned quantization schemes to mitigate accuracy loss. These advancements are significant for deploying large models on resource-constrained devices and accelerating inference, impacting both the efficiency of machine learning research and its practical applications.

Papers