Paper ID: 2404.19664

Towards Generalist Robot Learning from Internet Video: A Survey

Robert McCarthy, Daniel C.H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li

Scaling deep learning to huge internet-scraped datasets has yielded remarkably general capabilities in natural language processing and visual understanding and generation. In contrast, data is scarce and expensive to collect in robotics. This has seen robot learning struggle to match the generality of capabilities observed in other domains. Learning from Videos (LfV) methods seek to address this data bottleneck by augmenting traditional robot data with large internet-scraped video datasets. Such video data may provide the model with foundational information regarding physical behaviours and the physics of the world. This holds great promise for improving the generality of our robots. In this survey, we present an overview of the emerging field of LfV. We outline fundamental concepts, including the benefits and challenges of LfV. We provide a comprehensive review of current methods for: extracting knowledge from large-scale internet video; tackling key LfV challenges; and boosting downstream reinforcement and robot learning via the use of video data. LfV datasets and benchmarks are also reviewed. The survey closes with a critical discussion of challenges and opportunities. Here, we advocate for scalable foundation model approaches that can leverage the full range of available internet video to aid the learning of robot policies and dynamics models. We hope this survey can inform and catalyse further LfV research, facilitating progress towards the development of general-purpose robots.

Submitted: Apr 30, 2024