Multimodal Interaction
Multimodal interaction research focuses on developing systems that seamlessly integrate and interpret information from multiple sensory modalities (e.g., text, audio, vision) to enable more natural and effective human-computer interaction. Current research emphasizes developing robust model architectures, such as transformers and contrastive learning methods, to effectively fuse multimodal data and accurately infer user intent or emotion, often leveraging large language models for higher-level reasoning. This field is significant for advancing human-robot interaction, improving assistive technologies, and creating more intuitive interfaces for various applications, including autonomous driving and healthcare.
Papers
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving
Zeyu Yang, Nan Song, Wei Li, Xiatian Zhu, Li Zhang, Philip H. S. Torr