Video Language Model

Video Language Models (VLMs) aim to bridge the gap between visual and textual information in videos, enabling computers to understand and reason about video content in a human-like way. Current research focuses on improving VLM performance through larger datasets, more efficient architectures (like transformer-based models and those incorporating memory mechanisms), and innovative training strategies such as contrastive learning and instruction tuning. These advancements are crucial for applications ranging from automated video captioning and question answering to robotic control and unusual activity detection, driving significant progress in both computer vision and natural language processing.

Papers

September 15, 2022

LAVIS: A Library for Language-Vision Intelligence
Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C. H. Hoi
Vision Language Vision Language Task Easy to Use Library Video Language Model Visual Linguistic Open Source Deep Learning

July 16, 2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang, Yinan Li, Jiashi Feng, Xinglong Wu, Xiaoshuai Sun, Rongrong Ji
Video Question Answering Video Understanding Task Video Language Model Video Language Video Language Alignment Clover Sowing

July 5, 2022

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Madeline C. Schiappa, Shruti Vyas, Hamid Palangi, Yogesh S. Rawat, Vibhav Vineet
Video Language Model Video Language Text to Video Retrieval Robustness Analysis Level Perturbation

July 4, 2022

June 7, 2022

Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei, Tamara L. Berg, Mohit Bansal
Biased Datasets Video Language Model Video Language Video Language Task Video Language Modeling

May 22, 2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji
Language Model Video Language Model Image Descriptor

March 14, 2022

All in One: Exploring Unified Video-Language Pre-training
Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
Fusion Transformer Video Language Model Video Language