Audio Gesture

Audio-gesture research focuses on generating realistic and semantically meaningful human gestures synchronized with speech, primarily for creating lifelike virtual avatars and improving human-computer interaction. Current efforts concentrate on developing efficient and effective deep learning models, including transformers and diffusion models, often incorporating hierarchical structures to capture the complex interplay between speech rhythm, semantics, and gesture articulation. These advancements are improving the quality and naturalness of synthesized gestures, with applications ranging from virtual assistants to more engaging and expressive communication technologies. The field is also exploring optimal representations of gesture data (2D vs. 3D) and efficient multimodal fusion techniques for improved performance and real-time applications.

Papers