Multimodal Task
Multimodal tasks involve integrating information from multiple sources like text, images, and audio to perform complex reasoning and generation. Current research focuses on developing and evaluating large multimodal models (LLMs) using techniques like next-token prediction, prompt tuning, and mixture-of-experts architectures to improve efficiency and performance across diverse tasks, including visual question answering and image captioning. These advancements are significant for improving the capabilities of AI systems in various fields, particularly those requiring the interpretation and generation of multimodal data, such as healthcare and insurance. Addressing challenges like hallucination and improving the explainability of these models remains a key focus.
Papers
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
What Makes Multimodal In-Context Learning Work?
Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski