Visual Token
Visual tokens represent visual information as discrete units for processing within vision-language models (VLMs), aiming to bridge the gap between visual and textual data for improved multimodal understanding. Current research focuses on optimizing visual token efficiency through techniques like token sparsification, pruning, and adaptive granularity control, often employing transformer architectures and novel attention mechanisms to reduce computational costs while maintaining accuracy. These advancements are crucial for deploying VLMs in resource-constrained environments and improving the performance of various applications, including autonomous driving, image captioning, and visual question answering.
Papers
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez
QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model
Fei Xie, Weijia Zhang, Zhongdao Wang, Chao Ma
Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models
Yubo Wang, Chaohu Liu, Yanqiu Qu, Haoyu Cao, Deqiang Jiang, Linli Xu
Retrieval Replace Reduction: An effective visual token reduction method via semantic match
Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li