Visual Question
Visual Question Answering (VQA) aims to develop systems that can accurately answer natural language questions about the content of images or videos. Current research focuses on improving model robustness and accuracy, particularly for complex questions requiring spatial reasoning, multi-modal fusion (combining visual and textual information), and handling diverse question types, often employing large language models (LLMs) and vision transformers (ViTs) within various architectures. The field's significance lies in its potential for applications ranging from assisting visually impaired individuals to enhancing medical diagnosis and autonomous driving, driving advancements in multimodal learning and reasoning.
Papers
March 8, 2023
February 25, 2023
February 23, 2023
December 16, 2022
December 14, 2022
December 3, 2022
November 24, 2022
October 26, 2022
October 24, 2022
October 10, 2022
August 17, 2022
May 25, 2022
May 10, 2022
May 9, 2022
May 6, 2022
March 6, 2022
March 2, 2022
February 9, 2022
February 4, 2022