Low Bit
Low-bit quantization aims to reduce the computational and memory demands of deep neural networks by representing model parameters and activations using fewer bits, thereby improving efficiency without significant accuracy loss. Current research focuses on developing novel quantization techniques for various architectures, including transformers, convolutional neural networks, and large language models, often employing methods like data-free quantization, layer-wise quantization, and adaptive precision strategies. This area is crucial for deploying large models on resource-constrained devices and accelerating inference, impacting both the efficiency of machine learning research and the practical applications of AI in various domains.
Papers
Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction
Marcin Pietroń, Dominik Żurek
HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation
Xinheng Liu, Yao Chen, Prakhar Ganesh, Junhao Pan, Jinjun Xiong, Deming Chen