Video Language Model
Video Language Models (VLMs) aim to bridge the gap between visual and textual information in videos, enabling computers to understand and reason about video content in a human-like way. Current research focuses on improving VLM performance through larger datasets, more efficient architectures (like transformer-based models and those incorporating memory mechanisms), and innovative training strategies such as contrastive learning and instruction tuning. These advancements are crucial for applications ranging from automated video captioning and question answering to robotic control and unusual activity detection, driving significant progress in both computer vision and natural language processing.
Papers
Egocentric Video-Language Pretraining @ Ego4D Challenge 2022
Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou
Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
Kevin Qinghong Lin, Alex Jinpeng Wang, Rui Yan, Eric Zhongcong Xu, Rongcheng Tu, Yanru Zhu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Wei Liu, Mike Zheng Shou