Vision Language Understanding

Vision-language understanding (VLU) research aims to enable computers to comprehend and interact with both visual and textual information simultaneously. Current efforts focus on improving the robustness and detail-oriented capabilities of large vision-language models (LVLMs), addressing issues like susceptibility to misleading prompts ("sycophancy") and enhancing their ability to perceive fine-grained visual details. This involves developing novel architectures and training methods, such as incorporating contrastive and reconstruction learning, instruction tuning, and efficient retrieval mechanisms for handling long videos. Advances in VLU have significant implications for various applications, including robotics, image analysis, and multimodal interaction systems.

Papers