Multi Modal Generation
Multimodal generation focuses on creating AI systems that can generate outputs across multiple data types (e.g., text, images, video, audio) in a coherent and contextually relevant manner. Current research emphasizes developing unified model architectures, such as transformers and diffusion models, often incorporating techniques like contrastive learning and cross-modal refinement to improve data alignment and generation quality. This field is significant because it enables the creation of more realistic and versatile AI systems with applications ranging from improved data augmentation and synthetic data generation to personalized content creation and enhanced human-computer interaction.
Papers
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez, Ram Pasunuru, Scott Yih, Sravya Popuri, Xing Liu, Carole-Jean Wu
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Jiacong Wang, Bohong Wu, Haiyong Jiang, Xun Zhou, Xin Xiao, Haoyuan Guo, Jun Xiao