Visual Question
Visual Question Answering (VQA) aims to develop systems that can accurately answer natural language questions about the content of images or videos. Current research focuses on improving model robustness and accuracy, particularly for complex questions requiring spatial reasoning, multi-modal fusion (combining visual and textual information), and handling diverse question types, often employing large language models (LLMs) and vision transformers (ViTs) within various architectures. The field's significance lies in its potential for applications ranging from assisting visually impaired individuals to enhancing medical diagnosis and autonomous driving, driving advancements in multimodal learning and reasoning.
Papers
Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA
Elham J. Barezi, Parisa Kordjamshidi