Multi Modal Training
Multimodal training aims to improve machine learning models by training them on data encompassing multiple modalities, such as text, images, audio, and video, to achieve a more comprehensive understanding of information. Current research focuses on developing efficient training frameworks for large language and multimodal models, exploring various architectures like transformers and encoder-decoder networks, and investigating optimal strategies for data fusion and modality alignment. This approach holds significant promise for enhancing the robustness and performance of AI systems across diverse applications, including machine translation, image captioning, and medical diagnosis, by leveraging the complementary information provided by different data types.
Papers
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models
Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, Richang Hong
Revealing Vision-Language Integration in the Brain with Multimodal Networks
Vighnesh Subramaniam, Colin Conwell, Christopher Wang, Gabriel Kreiman, Boris Katz, Ignacio Cases, Andrei Barbu
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu Chen