Vision Language Understanding
Vision-language understanding (VLU) research aims to enable computers to comprehend and interact with both visual and textual information simultaneously. Current efforts focus on improving the robustness and detail-oriented capabilities of large vision-language models (LVLMs), addressing issues like susceptibility to misleading prompts ("sycophancy") and enhancing their ability to perceive fine-grained visual details. This involves developing novel architectures and training methods, such as incorporating contrastive and reconstruction learning, instruction tuning, and efficient retrieval mechanisms for handling long videos. Advances in VLU have significant implications for various applications, including robotics, image analysis, and multimodal interaction systems.
Papers
How Well Can Vision Language Models See Image Details?
Chenhui Gou, Abdulwahab Felemban, Faizan Farooq Khan, Deyao Zhu, Jianfei Cai, Hamid Rezatofighi, Mohamed Elhoseiny
MoExtend: Tuning New Experts for Modality and Task Extension
Shanshan Zhong, Shanghua Gao, Zhongzhan Huang, Wushao Wen, Marinka Zitnik, Pan Zhou