Visual Question
Visual Question Answering (VQA) aims to develop systems that can accurately answer natural language questions about the content of images or videos. Current research focuses on improving model robustness and accuracy, particularly for complex questions requiring spatial reasoning, multi-modal fusion (combining visual and textual information), and handling diverse question types, often employing large language models (LLMs) and vision transformers (ViTs) within various architectures. The field's significance lies in its potential for applications ranging from assisting visually impaired individuals to enhancing medical diagnosis and autonomous driving, driving advancements in multimodal learning and reasoning.
Papers
Goal-Oriented Semantic Communication for Wireless Visual Question Answering with Scene Graphs
Sige Liu, Nan Li, Yansha Deng
A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning
Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, Weiming Zeng