Gesture Synthesis

Gesture synthesis aims to automatically generate realistic and contextually appropriate 3D gestures from various input modalities like speech and text. Current research heavily utilizes diffusion models and transformers, often incorporating multimodal conditioning (e.g., audio, text, emotion) to improve the semantic and rhythmic alignment of synthesized gestures with the input. This field is crucial for advancing human-computer interaction, virtual reality, and animation by creating more natural and engaging interactions with digital characters and robots. The development of large, high-quality datasets is also a significant focus, enabling the training of more robust and generalizable models.

Papers