Bit Vision Transformer

Bit vision transformers (BitViTs) aim to reduce the computational cost and memory footprint of vision transformers (ViTs) by representing their weights and activations using fewer bits, thereby enabling efficient deployment on resource-constrained devices. Current research focuses on developing novel quantization techniques, such as those employing learnable scaling factors or softmax-aware binarization, to minimize accuracy loss during this compression. These advancements, applied to architectures like DeiT and Swin, are improving the performance of low-bit ViTs, with some methods even achieving accuracy comparable to or exceeding full-precision models, and others facilitating automated hardware acceleration for real-time applications.

Papers