Diffusion Transformer
Diffusion Transformers (DiTs) are a class of generative models leveraging the transformer architecture to improve upon the capabilities of traditional diffusion models, primarily aiming for efficient and high-quality generation of various data modalities, including images, audio, and video. Current research focuses on optimizing DiT architectures for speed and efficiency through techniques like dynamic computation, token caching, and quantization, as well as exploring their application in diverse tasks such as image super-resolution, text-to-speech synthesis, and medical image segmentation. The improved efficiency and scalability of DiTs, along with their ability to handle complex data dependencies, are significantly impacting generative modeling across multiple scientific fields and practical applications.
Papers
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen
Presto! Distilling Steps and Layers for Accelerating Music Generation
Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang
EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing
Haotian Sun, Bowen Zhang, Yanghao Li, Haoshuo Huang, Tao Lei, Ruoming Pang, Bo Dai, Nan Du
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration
Yushi Huang, Zining Wang, Ruihao Gong, Jing Liu, Xinjie Zhang, Jinyang Guo, Xianglong Liu, Jun Zhang