Multimodal LLM
Multimodal Large Language Models (MLLMs) aim to integrate diverse data modalities, such as text, images, and video, into a unified framework for enhanced understanding and generation. Current research emphasizes efficient fusion of visual and textual information, often employing techniques like early fusion mechanisms and specialized adapters within transformer-based architectures, as well as exploring the use of Mixture-of-Experts (MoE) models. This field is significant due to its potential to improve various applications, including image captioning, visual question answering, and more complex tasks requiring cross-modal reasoning, while also addressing challenges like hallucinations and bias.
Papers
Probing Multimodal LLMs as World Models for Driving
Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang