Visual Reasoning
Visual reasoning aims to enable artificial intelligence systems to understand and reason using visual information, mirroring human cognitive abilities. Current research focuses on developing and evaluating large vision-language models (VLMs) and multimodal large language models (MLLMs), often employing transformer architectures and techniques like chain-of-thought prompting and active perception, to improve performance on various visual reasoning tasks such as visual question answering and object manipulation. These advancements are significant because they address limitations in existing AI systems and hold potential for applications in robotics, medical image analysis, and other fields requiring complex visual interpretation and decision-making.
Papers
Large Language Models are Visual Reasoning Coordinators
Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou