Video Grounding

Video grounding aims to precisely locate in a video the temporal segment corresponding to a given textual or spoken language query. Current research focuses on improving the scalability and accuracy of grounding models, particularly for long videos and complex queries, employing techniques like late fusion, efficient sampling, and novel transformer architectures with learnable tokens or dynamic moment queries. These advancements are crucial for enhancing video understanding capabilities in various applications, including video retrieval, summarization, and question answering, and are driving the development of more robust and efficient multimodal learning models.

Papers