Visual Question Answering
Visual Question Answering (VQA) aims to enable computers to answer questions about images, requiring sophisticated integration of visual and linguistic understanding. Current research emphasizes improving model robustness and reliability, focusing on addressing issues like inconsistencies in responses, hallucinations, and the handling of unanswerable questions, often using large multimodal language models (MLLMs) like BLIP-2 and LLaVA. This field is crucial for advancing AI's ability to interact with the world in a more human-like way, with applications ranging from assistive technologies for visually impaired individuals to medical image analysis and automated data visualization evaluation.
Papers
Recursive Visual Programming
Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell
Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario
Yimin Sun, Chao Wang, Yan Peng
CLAMP: Contrastive LAnguage Model Prompt-tuning
Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim
Good Questions Help Zero-Shot Image Reasoning
Kaiwen Yang, Tao Shen, Xinmei Tian, Xiubo Geng, Chongyang Tao, Dacheng Tao, Tianyi Zhou
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, Xu Yang