Vision Language Action
Vision-Language-Action (VLA) models integrate computer vision, natural language processing, and robotics to enable robots to understand and execute complex tasks instructed via natural language commands and visual input. Current research focuses on improving the robustness and generalization of these models, often employing transformer-based architectures and techniques like chain-of-thought prompting to enhance reasoning capabilities, as well as developing efficient training methods and evaluation platforms. This field is significant for advancing embodied AI, with potential applications ranging from surgical assistance and household robotics to autonomous driving and industrial automation.
30papers
Papers - Page 3
January 16, 2025
GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation
Weiliang Tang, Jia-Hui Pan, Yun-Hui Liu, Masayoshi Tomizuka, Li Erran Li, Chi-Wing Fu, Mingyu DingFAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine
November 29, 2024
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei LiuCogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu+6