Visual Reasoning
Visual reasoning aims to enable artificial intelligence systems to understand and reason using visual information, mirroring human cognitive abilities. Current research focuses on developing and evaluating large vision-language models (VLMs) and multimodal large language models (MLLMs), often employing transformer architectures and techniques like chain-of-thought prompting and active perception, to improve performance on various visual reasoning tasks such as visual question answering and object manipulation. These advancements are significant because they address limitations in existing AI systems and hold potential for applications in robotics, medical image analysis, and other fields requiring complex visual interpretation and decision-making.
Papers
Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos
Zhimin Shao, Jialang Xu, Danail Stoyanov, Evangelos B. Mazomenos, Yueming Jin
Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu
Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA
Elham J. Barezi, Parisa Kordjamshidi