Paper ID: 2404.19664 • Published Apr 30, 2024
Towards Generalist Robot Learning from Internet Video: A Survey
Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Scaling deep learning to massive, diverse internet data has yielded
remarkably general capabilities in visual and natural language understanding
and generation. However, data has remained scarce and challenging to collect in
robotics, seeing robot learning struggle to obtain similarly general
capabilities. Promising Learning from Videos (LfV) methods aim to address the
robotics data bottleneck by augmenting traditional robot data with large-scale
internet video data. This video data offers broad foundational information
regarding physical behaviour and the underlying physics of the world, and thus
can be highly informative for a generalist robot.
In this survey, we present a thorough overview of the emerging field of LfV.
We outline fundamental concepts, including the benefits and challenges of LfV.
We provide a comprehensive review of current methods for extracting knowledge
from large-scale internet video, addressing key challenges in LfV, and boosting
downstream robot and reinforcement learning via the use of video data. The
survey concludes with a critical discussion of challenges and opportunities in
LfV. Here, we advocate for scalable foundation model approaches that can
leverage the full range of available internet video to improve the learning of
robot policies and dynamics models. We hope this survey can inform and catalyse
further LfV research, driving progress towards the development of
general-purpose robots.