Multi Modal Large Language Model
Multi-modal large language models (MLLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between human-like understanding and machine intelligence. Current research emphasizes improving the consistency and fairness of MLLMs, exploring efficient fusion mechanisms (like early fusion and Mixture-of-Experts architectures), and developing benchmarks to evaluate their performance across diverse tasks, including medical image analysis and autonomous driving. This rapidly evolving field holds significant potential for advancing various applications, from healthcare diagnostics to robotics, by enabling more robust and reliable AI systems capable of handling real-world complexities.
Papers
TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model
Yihao Zhao, Enhao Zhong, Cuiyun Yuan, Yang Li, Man Zhao, Chunxia Li, Jun Hu, Chenbin Liu
OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving
Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, Wenchao Ding
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
Kengo Nakata, Daisuke Miyashita, Youyang Ng, Yasuto Hoshi, Jun Deguchi
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
Jonggwon Park, Soobum Kim, Byungmu Yoon, Jihun Hyun, Kyoyun Choi
Revisiting Multi-Modal LLM Evaluation
Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
Zhaowei Li, Wei Wang, YiQing Cai, Xu Qi, Pengyu Wang, Dong Zhang, Hang Song, Botian Jiang, Zhida Huang, Tao Wang
Infusing Environmental Captions for Long-Form Video Language Grounding
Hyogun Lee, Soyeon Hong, Mujeen Sung, Jinwoo Choi