Video Language Pre Training
Video-language pre-training (VLP) aims to learn shared representations between video and text data through self-supervised learning, enabling improved performance on various downstream tasks like video retrieval and question answering. Current research emphasizes efficient model architectures, focusing on techniques like hierarchical representations, fine-grained spatio-temporal alignment, and parameter-efficient adaptation to reduce computational costs and improve generalization. These advancements are significant because they enable more robust and efficient video understanding systems, with applications ranging from improved search capabilities to more sophisticated AI assistants.
Papers
Egocentric Video-Language Pretraining @ Ego4D Challenge 2022
Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, Eric Zhongcong Xu, Rongcheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou