Visual Language Model
Visual Language Models (VLMs) aim to integrate visual and textual information, enabling machines to understand and reason about the world in a multimodal way. Current research focuses on improving VLMs' abilities in complex reasoning tasks, such as resolving ambiguities, understanding occluded objects, and handling inconsistent information across modalities, often leveraging architectures that combine large language models with visual encoders and employing techniques like contrastive learning and prompt engineering. These advancements are significant because they pave the way for more robust and reliable applications in diverse fields, including robotics, medical imaging, and social media analysis. Furthermore, ongoing work addresses ethical concerns like bias mitigation and hallucination reduction to ensure responsible development and deployment.
Papers
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
Huabin Liu, Filip Ilievski, Cees G. M. Snoek
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation
Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tadevosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, Dzmitry Tsetserukou