Multimodal Reasoning
Multimodal reasoning focuses on developing AI systems that can understand and reason using information from multiple sources like text, images, and other sensory data. Current research emphasizes improving the ability of large language and vision-language models to perform complex reasoning tasks across modalities, often using techniques like chain-of-thought prompting, knowledge graph integration, and multi-agent debate frameworks. This field is crucial for advancing AI capabilities in various applications, including healthcare diagnostics, robotics, and fact-checking, where integrating diverse information sources is essential for accurate and reliable decision-making. The development of new benchmark datasets specifically designed to challenge multimodal reasoning abilities is also a significant area of focus.
Papers
Progressive Multimodal Reasoning via Active Retrieval
Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
FiVL: A Framework for Improved Vision-Language Alignment
Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue
Question Answering for Decisionmaking in Green Building Design: A Multimodal Data Reasoning Method Driven by Large Language Models
Yihui Li, Xiaoyue Yan, Hao Zhou, Borong Lin