Vision Language Understanding
Vision-language understanding (VLU) research aims to enable computers to comprehend and interact with both visual and textual information simultaneously. Current efforts focus on improving the robustness and detail-oriented capabilities of large vision-language models (LVLMs), addressing issues like susceptibility to misleading prompts ("sycophancy") and enhancing their ability to perceive fine-grained visual details. This involves developing novel architectures and training methods, such as incorporating contrastive and reconstruction learning, instruction tuning, and efficient retrieval mechanisms for handling long videos. Advances in VLU have significant implications for various applications, including robotics, image analysis, and multimodal interaction systems.
Papers
CoPL: Contextual Prompt Learning for Vision-Language Understanding
Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph, Balaji Vasan Srinivasan
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang