VQA Task

Visual Question Answering (VQA) aims to build systems that can answer questions about images using both visual and textual information. Current research emphasizes improving the accuracy and robustness of VQA models, particularly by addressing issues like reliance on spurious correlations (shortcuts), integrating external knowledge sources (e.g., Wikipedia), and handling diverse question types and image modalities (including medical images and videos). This involves developing novel architectures, such as retrieval-augmented generation models and modular frameworks combining large language models with visual grounding modules, and focusing on mitigating biases and improving generalization to out-of-distribution data. Advances in VQA have significant implications for various applications, including image retrieval, medical diagnosis, and intelligent transportation systems.

Papers