Co Speech Gesture Generation

Co-speech gesture generation aims to create realistic and natural hand and body movements synchronized with spoken language, primarily for virtual agents and avatars. Current research heavily utilizes diffusion models and transformers, often incorporating multimodal inputs like text and emotion alongside audio, to improve gesture realism, semantic alignment with speech, and controllability. These advancements are driving progress in creating more engaging and believable virtual interactions, with applications ranging from virtual assistants to video game characters and the Metaverse. The field is also exploring efficient inference strategies and the impact of different gesture representations (2D vs. 3D) on generated motion quality.

Papers