Multimodal Foundation Model
Multimodal foundation models integrate multiple data modalities, such as text, images, and audio, to create powerful AI systems capable of complex tasks. Current research emphasizes improving model performance through optimized data preprocessing (e.g., caption generation for image-text alignment), efficient utilization of existing models for new data types (e.g., representing time series data as plots), and mitigating limitations like hallucination and bias. These advancements are significant because they enable more robust and versatile AI applications across diverse fields, including healthcare, autonomous driving, and scientific discovery, while also addressing critical issues of resource efficiency and ethical concerns.
Papers
ViT-Lens: Towards Omni-modal Representations
Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu