Visual Language Model
Visual Language Models (VLMs) aim to integrate visual and textual information, enabling machines to understand and reason about the world in a multimodal way. Current research focuses on improving VLMs' abilities in complex reasoning tasks, such as resolving ambiguities, understanding occluded objects, and handling inconsistent information across modalities, often leveraging architectures that combine large language models with visual encoders and employing techniques like contrastive learning and prompt engineering. These advancements are significant because they pave the way for more robust and reliable applications in diverse fields, including robotics, medical imaging, and social media analysis. Furthermore, ongoing work addresses ethical concerns like bias mitigation and hallucination reduction to ensure responsible development and deployment.
Papers
Rethinking VLMs and LLMs for Image Classification
Avi Cooper, Keizo Kato, Chia-Hsien Shih, Hiroaki Yamane, Kasper Vinken, Kentaro Takemoto, Taro Sunagawa, Hao-Wei Yeh, Jin Yamanaka, Ian Mason, Xavier Boix
NL-Eye: Abductive NLI for Images
Mor Ventura, Michael Toker, Nitay Calderon, Zorik Gekhman, Yonatan Bitton, Roi Reichart
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh