Text to Video Generation
Text-to-video generation aims to create videos from textual descriptions, bridging the gap between human language and visual media. Current research heavily utilizes diffusion models, often incorporating 3D U-Nets or transformer architectures, and focuses on improving video quality, temporal consistency, controllability (including camera movement and object manipulation), and compositional capabilities—the ability to synthesize videos with multiple interacting elements. These advancements hold significant implications for various fields, including film production, animation, and virtual reality, by automating video creation and enabling more precise control over generated content.
Papers
ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models
Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo