Generative Vision Language Model

Generative Vision-Language Models (VLMs) aim to create systems that seamlessly understand and generate both visual and textual information, bridging the gap between image and text modalities. Current research focuses on improving the quality and reliability of VLM outputs, addressing issues like hallucinations and biases through techniques such as direct preference optimization and multi-modal mutual information decoding, often within transformer-based architectures. These advancements are significant because they enable more robust and reliable applications in diverse fields, including robotics, medical image analysis, and content creation, while also providing valuable tools for analyzing and improving the explainability of other AI models.

Papers