Multimodal Model
Multimodal models integrate information from multiple sources like text, images, audio, and video to achieve a more comprehensive understanding than unimodal approaches. Current research focuses on improving model interpretability, addressing biases, enhancing robustness against adversarial attacks and missing data, and developing efficient architectures like transformers and state-space models for various tasks including image captioning, question answering, and sentiment analysis. These advancements are significant for applications ranging from healthcare and robotics to more general-purpose AI systems, driving progress in both fundamental understanding and practical deployment of AI.
Papers
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu
MAIRA-1: A specialised large multimodal model for radiology report generation
Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, Noel Codella, Matthew P. Lungren, Maria Teodora Wetscherek, Ozan Oktay, Javier Alvarez-Valle
Multimodal Large Language Models: A Survey
Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, Philip S. Yu
Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling
Yaqing Wang, Jialin Wu, Tanmaya Dabral, Jiageng Zhang, Geoff Brown, Chun-Ta Lu, Frederick Liu, Yi Liang, Bo Pang, Michael Bendersky, Radu Soricut
BanglaAbuseMeme: A Dataset for Bengali Abusive Meme Classification
Mithun Das, Animesh Mukherjee
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria
EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs
Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu