Embodied Vision
Embodied vision research focuses on enabling artificial agents to understand and interact with the world through the integration of vision and physical action. Current efforts concentrate on developing robust and generalizable models, often leveraging large language models (LLMs) and vision-language models (VLMs) within frameworks that incorporate chain-of-thought reasoning and hierarchical skill decomposition to improve planning and execution of complex tasks. This interdisciplinary field is significantly advancing robotics, autonomous driving, and human-computer interaction by creating agents capable of more nuanced perception, reasoning, and interaction in dynamic environments.
Papers
WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making
Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, Zhi-Hua Zhou
MIPD: A Multi-sensory Interactive Perception Dataset for Embodied Intelligent Driving
Zhiwei Li, Tingzhen Zhang, Meihua Zhou, Dandan Tang, Pengwei Zhang, Wenzhuo Liu, Qiaoning Yang, Tianyu Shen, Kunfeng Wang, Huaping Liu