Diffusion Transformer
Diffusion Transformers (DiTs) are a class of generative models leveraging the transformer architecture to improve upon the capabilities of traditional diffusion models, primarily aiming for efficient and high-quality generation of various data modalities, including images, audio, and video. Current research focuses on optimizing DiT architectures for speed and efficiency through techniques like dynamic computation, token caching, and quantization, as well as exploring their application in diverse tasks such as image super-resolution, text-to-speech synthesis, and medical image segmentation. The improved efficiency and scalability of DiTs, along with their ability to handle complex data dependencies, are significantly impacting generative modeling across multiple scientific fields and practical applications.
Papers
World-consistent Video Diffusion with Explicit 3D Modeling
Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
CPA: Camera-pose-awareness Diffusion Transformer for Video Generation
Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
TinyFusion: Diffusion Transformers Learned Shallow
Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang
Accelerating Vision Diffusion Transformers with Skip Branches
Guanjie Chen, Xinyu Zhao, Yucheng Zhou, Tianlong Chen, Cheng Yu
On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality
Jerry Yao-Chieh Hu, Weimin Wu, Yi-Chen Lee, Yu-Chao Huang, Minshuo Chen, Han Liu
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan
TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On
Zhenchen Wan, Yanwu Xu, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
OminiControl: Minimal and Universal Control for Diffusion Transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang
HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads
Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo