Multimodal Foundation Model
Multimodal foundation models integrate multiple data modalities, such as text, images, and audio, to create powerful AI systems capable of complex tasks. Current research emphasizes improving model performance through optimized data preprocessing (e.g., caption generation for image-text alignment), efficient utilization of existing models for new data types (e.g., representing time series data as plots), and mitigating limitations like hallucination and bias. These advancements are significant because they enable more robust and versatile AI applications across diverse fields, including healthcare, autonomous driving, and scientific discovery, while also addressing critical issues of resource efficiency and ethical concerns.
Papers
Fortify Your Foundations: Practical Privacy and Security for Foundation Model Deployments In The Cloud
Marcin Chrapek, Anjo Vahldiek-Oberwagner, Marcin Spoczynski, Scott Constable, Mona Vij, Torsten Hoefler
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition
Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
Plots Unlock Time-Series Understanding in Multimodal Models
Mayank Daswani, Mathias M.J. Bellaiche, Marc Wilson, Desislav Ivanov, Mikhail Papkov, Eva Schnider, Jing Tang, Kay Lamerigts, Gabriela Botea, Michael A. Sanchez, Yojan Patel, Shruthi Prabhakara, Shravya Shetty, Umesh Telang