Multimodal Input
Multimodal input processing focuses on enabling artificial intelligence systems to understand and integrate information from multiple sources like text, images, audio, and video, aiming to achieve a more comprehensive and human-like understanding. Current research emphasizes improving the robustness and efficiency of multimodal large language models (MLLMs), addressing issues like hallucination, knowledge conflicts between modalities, and the handling of missing or incomplete data through techniques such as causal inference, active perception evaluation, and masked modality projection. This field is significant because it underpins advancements in various applications, including robotics, personalized healthcare, and improved accessibility of information, by enabling more natural and effective human-computer interaction.
Papers
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion
Ziyue Wang, Chi Chen, Yiqi Zhu, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu