Multimodal Agent
Multimodal agents are AI systems designed to perceive and interact with the world using multiple sensory modalities, such as vision and language, aiming to achieve complex tasks beyond the capabilities of unimodal agents. Current research focuses on developing robust agent architectures that integrate large language models (LLMs) and vision-language models (VLMs) with various memory and planning mechanisms, often employing reinforcement learning and imitation learning techniques. This field is significant because it pushes the boundaries of AI towards more general-purpose intelligence, with potential applications ranging from automating complex computer tasks and improving human-computer interaction to revolutionizing fields like healthcare and robotics.
Papers
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions
Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao
CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration
Gongxin Yao, Yixin Xuan, Xinyang Li, Yu Pan
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs
Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors
Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam