Visual Question
Visual Question Answering (VQA) aims to develop systems that can accurately answer natural language questions about the content of images or videos. Current research focuses on improving model robustness and accuracy, particularly for complex questions requiring spatial reasoning, multi-modal fusion (combining visual and textual information), and handling diverse question types, often employing large language models (LLMs) and vision transformers (ViTs) within various architectures. The field's significance lies in its potential for applications ranging from assisting visually impaired individuals to enhancing medical diagnosis and autonomous driving, driving advancements in multimodal learning and reasoning.
Papers
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
Haibo Wang, Weifeng Ge
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
Haibo Wang, Chenghang Lai, Yixuan Sun, Weifeng Ge