Visual Token
Visual tokens represent visual information as discrete units for processing within vision-language models (VLMs), aiming to bridge the gap between visual and textual data for improved multimodal understanding. Current research focuses on optimizing visual token efficiency through techniques like token sparsification, pruning, and adaptive granularity control, often employing transformer architectures and novel attention mechanisms to reduce computational costs while maintaining accuracy. These advancements are crucial for deploying VLMs in resource-constrained environments and improving the performance of various applications, including autonomous driving, image captioning, and visual question answering.
Papers
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information
Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, Yang-Yang Liu, Cheng-Lin Liu
Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction
Gaotong Yu, Yi Chen, Jian Xu
Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation
Yushun Tang, Shuoshuo Chen, Zhehan Kan, Yi Zhang, Qinghai Guo, Zhihai He
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin