Visual Question Answering
Visual Question Answering (VQA) aims to enable computers to answer questions about images, requiring sophisticated integration of visual and linguistic understanding. Current research emphasizes improving model robustness and reliability, focusing on addressing issues like inconsistencies in responses, hallucinations, and the handling of unanswerable questions, often using large multimodal language models (MLLMs) like BLIP-2 and LLaVA. This field is crucial for advancing AI's ability to interact with the world in a more human-like way, with applications ranging from assistive technologies for visually impaired individuals to medical image analysis and automated data visualization evaluation.
Papers
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou
Bidirectional Contrastive Split Learning for Visual Question Answering
Yuwei Sun, Hideya Ochiai
FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang, Ata Mahjoubfar, Anupama Joshi