8 Bit Quantization

8-bit quantization aims to reduce the memory footprint and computational cost of deep neural networks by representing model parameters and activations with fewer bits, thereby improving efficiency for deployment on resource-constrained devices. Current research focuses on extending these techniques to sub-8-bit precision, particularly for challenging architectures like transformers and recurrent neural networks (RNNs), employing methods such as mixed-precision quantization and novel quantization-aware training algorithms. This work is crucial for enabling the deployment of large language models and other computationally intensive models on embedded systems and mobile devices, leading to significant improvements in energy efficiency and inference speed.

Papers