Visual Question Answering
Visual Question Answering (VQA) aims to enable computers to answer questions about images, requiring sophisticated integration of visual and linguistic understanding. Current research emphasizes improving model robustness and reliability, focusing on addressing issues like inconsistencies in responses, hallucinations, and the handling of unanswerable questions, often using large multimodal language models (MLLMs) like BLIP-2 and LLaVA. This field is crucial for advancing AI's ability to interact with the world in a more human-like way, with applications ranging from assistive technologies for visually impaired individuals to medical image analysis and automated data visualization evaluation.
Papers
MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)
Shih-Han Chou, Shivam Chandhok, James J. Little, Leonid Sigal
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang, Chi Chen, Fuwen Luo, Yurui Dong, Yuanchi Zhang, Yuzhuang Xu, Xiaolong Wang, Peng Li, Yang Liu
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
Xuwu Wang, Qiwen Cui, Yunzhe Tao, Yiran Wang, Ziwei Chai, Xiaotian Han, Boyi Liu, Jianbo Yuan, Jing Su, Guoyin Wang, Tingkai Liu, Liyu Chen, Tianyi Liu, Tao Sun, Yufeng Zhang, Sirui Zheng, Quanzeng You, Yang Yang, Hongxia Yang
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning
Niki Maria Foteinopoulou, Enjie Ghorbel, Djamila Aouada
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Jian Lan, Diego Frassinelli, Barbara Plank
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye, Hanane Azzag, Mustapha Lebbah
CAST: Cross-modal Alignment Similarity Test for Vision Language Models
Gautier Dagan, Olga Loginova, Anil Batra