Head Generation
Head generation, primarily focusing on audio-driven talking head synthesis, aims to create realistic and expressive video portraits from audio input. Current research heavily utilizes diffusion models and variations of transformer architectures, often incorporating techniques like disentanglement of facial features (e.g., expression, pose, lip movements) and hierarchical diffusion processes to improve both fidelity and control over generated videos. This field is significant for its applications in film, virtual reality, and digital human creation, driving advancements in both computer vision and generative modeling. Furthermore, ongoing work emphasizes improving the realism of generated videos, including high-frequency details and natural head movements, and expanding capabilities to handle multiple languages and emotional expression.
Papers
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, Qingfeng Liu
ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance
Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang