Vision Language Action
Vision-Language-Action (VLA) models integrate computer vision, natural language processing, and robotics to enable robots to understand and execute complex tasks instructed via natural language commands and visual input. Current research focuses on improving the robustness and generalization of these models, often employing transformer-based architectures and techniques like chain-of-thought prompting to enhance reasoning capabilities, as well as developing efficient training methods and evaluation platforms. This field is significant for advancing embodied AI, with potential applications ranging from surgical assistance and household robotics to autonomous driving and industrial automation.
Papers
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
Pranav Guruprasad, Harshvardhan Sikka, Jaewoo Song, Yangyue Wang, Paul Pu Liang
Diffusion Transformer Policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, Yuntao Chen
How to Build a Pre-trained Multimodal model for Simultaneously Chatting and Decision-making?
Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, Bin Liu
A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
ByungOk Han, Jaehong Kim, Jinhyeok Jang
Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study
Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, Lei Ma
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian Tang