Vision Language Action
Vision-Language-Action (VLA) models integrate computer vision, natural language processing, and robotics to enable robots to understand and execute complex tasks instructed via natural language commands and visual input. Current research focuses on improving the robustness and generalization of these models, often employing transformer-based architectures and techniques like chain-of-thought prompting to enhance reasoning capabilities, as well as developing efficient training methods and evaluation platforms. This field is significant for advancing embodied AI, with potential applications ranging from surgical assistance and household robotics to autonomous driving and industrial automation.
Papers
Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study
Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, Lei Ma
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian Tang