Text to Music Diffusion Model

Text-to-music diffusion models aim to generate realistic music from textual descriptions, leveraging the power of diffusion processes to create high-quality audio. Current research focuses on improving controllability through techniques like fine-tuning with audio prompts, subtractive training for stem insertion, and inference-time optimization, often employing attention-based adapters or cascaded diffusion model architectures. These advancements enable more nuanced control over musical elements, including genre, timbre, rhythm, and even the addition or modification of individual instrument parts, with applications ranging from music composition assistance to personalized music generation.

Papers