Cross Modal Generation
Cross-modal generation focuses on creating data in one modality (e.g., audio, images, haptic feedback) from input in another, aiming to bridge the gap between different sensory experiences and improve multimodal understanding. Current research heavily utilizes diffusion models, often enhanced with techniques like attention mechanisms and normalizing flows, to achieve this cross-modal translation, with applications ranging from image-to-music generation to synthesizing tactile sensations from visual data. This field is significant for advancing AI capabilities in creative applications, robotics, and medical imaging, enabling more robust and comprehensive data analysis and interaction with the environment.
Papers
DiffX: Guide Your Layout to Cross-Modal Generative Modeling
Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Qu Yang, Lan Du, Cunjian Chen, Kejie Huang
Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
Xiao Liu, Xiaoliu Guan, Yu Wu, Jiaxu Miao