Visual Token
Visual tokens represent visual information as discrete units for processing within vision-language models (VLMs), aiming to bridge the gap between visual and textual data for improved multimodal understanding. Current research focuses on optimizing visual token efficiency through techniques like token sparsification, pruning, and adaptive granularity control, often employing transformer architectures and novel attention mechanisms to reduce computational costs while maintaining accuracy. These advancements are crucial for deploying VLMs in resource-constrained environments and improving the performance of various applications, including autonomous driving, image captioning, and visual question answering.
Papers
Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation
Yushun Tang, Shuoshuo Chen, Zhehan Kan, Yi Zhang, Qinghai Guo, Zhihai He
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, Yu-Gang Jiang
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong