Text to Video Generation

Text-to-video generation aims to create videos from textual descriptions, bridging the gap between human language and visual media. Current research heavily utilizes diffusion models, often incorporating 3D U-Nets or transformer architectures, and focuses on improving video quality, temporal consistency, controllability (including camera movement and object manipulation), and compositional capabilities—the ability to synthesize videos with multiple interacting elements. These advancements hold significant implications for various fields, including film production, animation, and virtual reality, by automating video creation and enabling more precise control over generated content.

Papers

August 16, 2023

Dual-Stream Diffusion Net for Text-to-Video Generation
Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Dan Wang, Zhen Cui, Jian Yang
Text to Video Generation Video Motion

August 12, 2023

Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation
Zhichao Wang, Mengyu Dai, Keld Lundgaard
Zero Shot Text to Speech Text to Video Text to Video Generation Two Stage Talking Head Video Based Person Re Identification

June 5, 2023

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, Lidong Bing
Cross Modal Video Understanding Text to Video Generation Audio Visual Large Language Model

May 23, 2023

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin
Diffusion Model Video Generation Text to Video Generation Long Video Generation

May 22, 2023

ControlVideo: Training-free Controllable Text-to-Video Generation
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian
Video Generation Text to Video Generation

May 18, 2023

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu
Diffusion Model Video Generation Video Dataset Text to Video Generation Video Text Text Video Pair

May 4, 2023

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen, Lili Yu, Wenhan Xiong, Barlas Oğuz, Yashar Mehdad, Wen-tau Yih
Multi Stage Text to Video Text to Video Generation Video Captioning Captioning Benchmark Video Text Pre Training

April 17, 2023

April 3, 2023

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen
Human Pose Text to Image Text to Video Generation Character Image Animation

March 29, 2023

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu
Unified Framework Video Generation Text to Video Generation Video Annotation Video Generation Task

March 23, 2023

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi
Zero Shot Text to Image Diffusion Model Text to Image Synthesis Text to Video Generation Text to Video Synthesis

December 22, 2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
Text to Image Text to Video Text to Video Generation Image Diffusion Model Large Scale Video Hyper Tune T2I Diffusion Model Text Video Pair

November 23, 2022

Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
Video Generation Video Diffusion Model Text to Video Generation Long Video Generation

November 20, 2022

MagicVideo: Efficient Video Generation With Latent Diffusion Models
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, Jiashi Feng
Latent Diffusion Model Video Generation Video Diffusion Model Text to Video Generation Generated Video Image Distribution Image to Video Adaptation

September 29, 2022

Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman
Text to Video Text to Video Generation Video Text

June 7, 2022

FlexLip: A Controllable Text-to-Lip System
Dan Oneata, Beata Lorincz, Adriana Stan, Horia Cucu
Text to Video Generation Synthetic Medium Zero Shot Speaker Adaptation Lip to Speech

May 29, 2022

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
Transformer Megatron Decepticons Video Generation Text to Video Generation Video Text Large Scale Pretraining Hierarchical Training

December 6, 2021

Make It Move: Controllable Image-to-Video Generation with Text Descriptions
Yaosi Hu, Chong Luo, Zhenzhong Chen
Computer Vision Video Generation Text to Video Generation Text Description Video Generation Task