Visual Reasoning
Visual reasoning aims to enable artificial intelligence systems to understand and reason using visual information, mirroring human cognitive abilities. Current research focuses on developing and evaluating large vision-language models (VLMs) and multimodal large language models (MLLMs), often employing transformer architectures and techniques like chain-of-thought prompting and active perception, to improve performance on various visual reasoning tasks such as visual question answering and object manipulation. These advancements are significant because they address limitations in existing AI systems and hold potential for applications in robotics, medical image analysis, and other fields requiring complex visual interpretation and decision-making.
Papers
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning
Jingkun Ma, Runzhe Zhan, Derek F. Wong, Yang Li, Di Sun, Hou Pong Chan, Lidia S. Chao