Multi Modal
Multimodal research focuses on integrating and analyzing data from multiple sources (e.g., text, images, audio, sensor data) to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust models, often employing large language models (LLMs) and graph neural networks (GNNs), to handle the complexity of multimodal data and address challenges like error detection in mathematical reasoning, long-horizon inference, and efficient data fusion. This field is significant for advancing AI capabilities in diverse applications, including improved recommendation systems, assistive robotics, medical diagnosis, and autonomous driving, by enabling more nuanced and accurate interpretations of complex real-world scenarios.
Papers
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta, Shreyas Verma, Ujjwala Anantheswaran, Kevin Scaria, Mihir Parmar, Swaroop Mishra, Chitta Baral
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen
GAMMA-PD: Graph-based Analysis of Multi-Modal Motor Impairment Assessments in Parkinson's Disease
Favour Nerrise (1), Alice Louise Heiman (2), Ehsan Adeli (2, 3) ((1) Department of Electrical Engineering, Stanford University, Stanford, CA, USA, (2) Department of Computer Science, Stanford University, Stanford, CA, USA, (3) Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA)
Multimodal Auto Validation For Self-Refinement in Web Agents
Ruhana Azam, Tamer Abuelsaad, Aditya Vempaty, Ashish Jagmohan