Text to Video
Text-to-video (T2V) generation aims to create realistic videos from textual descriptions, focusing on improving temporal consistency, handling multiple objects and actions, and enhancing controllability. Current research heavily utilizes diffusion models, often building upon pre-trained text-to-image models and incorporating advanced architectures like Diffusion Transformers (DiT) and spatial-temporal attention mechanisms to improve video quality and coherence. This rapidly evolving field holds significant implications for content creation, education, and various other applications, driving advancements in both model architectures and evaluation methodologies to address challenges like hallucination and compositional generation.
Papers
November 13, 2024
November 5, 2024
October 31, 2024
October 27, 2024
October 18, 2024
October 15, 2024
October 8, 2024
October 7, 2024
October 6, 2024
September 23, 2024
September 17, 2024
September 11, 2024
September 5, 2024
August 26, 2024
August 22, 2024
August 15, 2024
August 5, 2024
August 1, 2024