Multimodal Perception
Multimodal perception research aims to create systems that integrate information from multiple sensory modalities (e.g., vision, audio, touch) for improved understanding and interaction with the environment. Current research focuses on developing unified model architectures, often based on transformers and incorporating techniques like attention mechanisms and mixture-of-experts, to efficiently process and fuse diverse data streams for tasks such as object detection, segmentation, and robot control. This field is crucial for advancing artificial intelligence, particularly in robotics and autonomous systems, by enabling more robust, adaptable, and human-like perception capabilities in complex real-world scenarios.
Papers
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang