Multi Modal Training
Multimodal training aims to improve machine learning models by training them on data encompassing multiple modalities, such as text, images, audio, and video, to achieve a more comprehensive understanding of information. Current research focuses on developing efficient training frameworks for large language and multimodal models, exploring various architectures like transformers and encoder-decoder networks, and investigating optimal strategies for data fusion and modality alignment. This approach holds significant promise for enhancing the robustness and performance of AI systems across diverse applications, including machine translation, image captioning, and medical diagnosis, by leveraging the complementary information provided by different data types.
Papers
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao