Visual Entailment
Visual entailment (VE) is a multimodal reasoning task that assesses whether an image semantically implies a given textual statement. Current research focuses on improving VE models' accuracy and robustness, particularly by exploring advanced architectures that leverage object-level alignment within images and text, and by incorporating uncertainty modeling and hierarchical alignment strategies. This work is significant because accurate VE systems are crucial for various applications, including fact verification, image captioning, and more generally, improving the reliability and understanding of information presented in image-text formats.
18papers
Papers
January 9, 2025
December 19, 2024
February 27, 2024
Probing Multimodal Large Language Models for Global and Local Semantic Representations
Mingxu Tao, Quzhe Huang, Kun Xu, Liwei Chen, Yansong Feng, Dongyan ZhaoArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
February 15, 2024
November 17, 2022
September 9, 2022
August 29, 2022