Video Language

Video language research focuses on enabling computers to understand and generate descriptions of videos using natural language, bridging the gap between visual and textual information. Current research emphasizes efficient model architectures, often based on transformers, that address the computational challenges posed by processing long videos and complex language queries, incorporating techniques like mixture-of-depths and masked autoencoders to improve efficiency and performance. This field is significant because it underpins advancements in various applications, including video retrieval, question answering, captioning, and robotics, driving progress in both fundamental computer vision and natural language processing. Improved video-language models are crucial for creating more intuitive and effective human-computer interfaces and enabling more sophisticated AI systems.

Papers

December 17, 2021

Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C. H. Hoi
Fine Grained Style PROMPT Cross Modal Alignment LD Align Video Language

December 14, 2021

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei
Vision Language Ticket BERT Denoising Process BERT Based Video Language Cross Modal Matching Multimodal BERT

December 1, 2021

November 19, 2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo
Zero Shot Cross Modal Multimodal Transformer Video Text Video Language High Resolution Video Video Language Representation

Video Language

Papers

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Object-aware Video-language Pre-training for Retrieval

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions