Paper ID: 2409.15867 • Published Sep 24, 2024
In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding
Moucheng Xu, Evangelos Chatzaroulas, Luc McCutcheon, Abdul Ahad, Hamzah Azeem, Janusz Marecki, Ammar Anwar
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
A Standard Operating Procedure (SOP) defines a low-level, step-by-step
written guide for a business software workflow. SOP generation is a crucial
step towards automating end-to-end software workflows. Manually creating SOPs
can be time-consuming. Recent advancements in large video-language models offer
the potential for automating SOP generation by analyzing recordings of human
demonstrations. However, current large video-language models face challenges
with zero-shot SOP generation. In this work, we first explore in-context
learning with video-language models for SOP generation. We then propose an
exploration-focused strategy called In-Context Ensemble Learning, to aggregate
pseudo labels of multiple possible paths of SOPs. The proposed in-context
ensemble learning as well enables the models to learn beyond its context window
limit with an implicit consistency regularisation. We report that in-context
learning helps video-language models to generate more temporally accurate SOP,
and the proposed in-context ensemble learning can consistently enhance the
capabilities of the video-language models in SOP generation.