Joint Vision Language
Joint vision-language research focuses on integrating visual and textual information to enable computers to understand and reason about the world in a more human-like way. Current efforts concentrate on improving the robustness and explainability of multimodal large language models (MLLMs), often employing transformer-based architectures and contrastive learning methods to align visual and textual representations, addressing issues like spurious biases and improving the interpretability of model decisions. This field is significant for advancing artificial intelligence, impacting applications such as visual question answering, image captioning, and more generally, enabling more sophisticated and reliable multimodal AI systems.
Papers
August 8, 2024
June 24, 2024
June 17, 2024
June 14, 2024
November 9, 2022
September 15, 2022
March 27, 2022