Multimodal Input
Multimodal input processing focuses on enabling artificial intelligence systems to understand and integrate information from multiple sources like text, images, audio, and video, aiming to achieve a more comprehensive and human-like understanding. Current research emphasizes improving the robustness and efficiency of multimodal large language models (MLLMs), addressing issues like hallucination, knowledge conflicts between modalities, and the handling of missing or incomplete data through techniques such as causal inference, active perception evaluation, and masked modality projection. This field is significant because it underpins advancements in various applications, including robotics, personalized healthcare, and improved accessibility of information, by enabling more natural and effective human-computer interaction.
Papers
Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
Guanyu Zhou, Yibo Yan, Xin Zou, Kun Wang, Aiwei Liu, Xuming Hu
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang, Chi Chen, Fuwen Luo, Yurui Dong, Yuanchi Zhang, Yuzhuang Xu, Xiaolong Wang, Peng Li, Yang Liu