Visual Token
Visual tokens represent visual information as discrete units for processing within vision-language models (VLMs), aiming to bridge the gap between visual and textual data for improved multimodal understanding. Current research focuses on optimizing visual token efficiency through techniques like token sparsification, pruning, and adaptive granularity control, often employing transformer architectures and novel attention mechanisms to reduce computational costs while maintaining accuracy. These advancements are crucial for deploying VLMs in resource-constrained environments and improving the performance of various applications, including autonomous driving, image captioning, and visual question answering.
Papers
Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction
Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang