Visual Entailment
Visual entailment (VE) is a multimodal reasoning task that assesses whether an image semantically implies a given textual statement. Current research focuses on improving VE models' accuracy and robustness, particularly by exploring advanced architectures that leverage object-level alignment within images and text, and by incorporating uncertainty modeling and hierarchical alignment strategies. This work is significant because accurate VE systems are crucial for various applications, including fact verification, image captioning, and more generally, improving the reliability and understanding of information presented in image-text formats.
Papers
Probing Multimodal Large Language Models for Global and Local Semantic Representations
Mingxu Tao, Quzhe Huang, Kun Xu, Liwei Chen, Yansong Feng, Dongyan Zhao
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin