Video Scene Graph

Video scene graphs (VSGs) represent videos as structured graphs, capturing the temporal relationships between objects and actions within a scene. Current research focuses on improving VSG generation from various video types, including egocentric and instructional videos, often employing self-supervised learning and multi-modal approaches that leverage information from audio narration or other sources. These advancements aim to overcome limitations of existing methods, such as reliance on noisy proposals or incomplete annotations, leading to more accurate and interpretable video understanding. The resulting improvements in VSG representation have implications for various applications, including video synthesis, activity summarization, and action anticipation.

Papers