Vision Language Planning

Vision-language planning (VLP) aims to create AI systems capable of understanding and acting upon instructions combining visual and textual information, bridging the gap between perception and action. Current research focuses on integrating large language and multi-modal models with computer vision techniques, often employing diffusion models and incorporating egocentric perspectives to improve task completion in complex, real-world scenarios like autonomous driving and robotic manipulation. This interdisciplinary field is significant for advancing AI capabilities in robotics, autonomous systems, and human-computer interaction, ultimately leading to more robust and adaptable intelligent agents.

Papers