Multi Modal
Multimodal research focuses on integrating and analyzing data from multiple sources (e.g., text, images, audio, sensor data) to achieve a more comprehensive understanding than any single modality allows. Current research emphasizes developing robust models, often employing large language models (LLMs) and graph neural networks (GNNs), to handle the complexity of multimodal data and address challenges like error detection in mathematical reasoning, long-horizon inference, and efficient data fusion. This field is significant for advancing AI capabilities in diverse applications, including improved recommendation systems, assistive robotics, medical diagnosis, and autonomous driving, by enabling more nuanced and accurate interpretations of complex real-world scenarios.
Papers
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu
Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents
Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese