Masked Diffusion Transformer

Masked Diffusion Transformers (MDTs) are a class of generative models leveraging the strengths of diffusion models and transformer architectures to synthesize high-quality data across diverse modalities, including images, music, and even co-speech gestures. Current research focuses on improving training efficiency through techniques like data pruning and reweighting, as well as enhancing model architectures to better capture contextual relationships within data, particularly in time series and image generation. The resulting improvements in speed, data efficiency, and generation quality have significant implications for various applications, ranging from efficient image synthesis and music generation to advanced anomaly detection in time series data.

Papers