Multimodal Language Model
Multimodal language models (MLLMs) aim to integrate and process information from multiple modalities, such as text, images, and video, to achieve a more comprehensive understanding of the world. Current research focuses on improving MLLM performance through techniques like fine-grained reward models, knowledge distillation to create smaller, more efficient models, and data augmentation strategies to address data scarcity and biases. These advancements are significant because they enhance the reliability and applicability of MLLMs across diverse fields, including medical diagnosis, video summarization, and autonomous driving, by enabling more accurate and nuanced interpretations of complex multimodal data.
Papers
Multimodal Latent Language Modeling with Next-Token Diffusion
Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei
Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training
Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Ao Luo, Li Yao, Cunjian Chen, Julian McAuley, Hanqian Wu
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
OpenMU: Your Swiss Army Knife for Music Understanding
Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji