Multimodal Large Language Model
Multimodal large language models (MLLMs) integrate multiple data modalities, such as text, images, and audio, to enhance understanding and reasoning capabilities beyond those of unimodal models. Current research emphasizes improving MLLM performance through refined architectures (e.g., incorporating visual grounding, chain-of-thought prompting), mitigating biases and hallucinations, and developing robust evaluation benchmarks that assess various aspects of multimodal understanding, including active perception and complex reasoning tasks. This work is significant because it pushes the boundaries of AI capabilities, leading to advancements in diverse applications like medical diagnosis, financial analysis, and robotic manipulation.
Papers
FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Zhikai Zhang, Yitang Li, Haofeng Huang, Mingxian Lin, Li Yi
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, Bo Zhao
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, Long Chen
First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
Enming Zhang, Ruobing Yao, Huanyong Liu, Junhui Yu, Jiale Wang
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
Yiqi Wu, Xiaodan Hu, Ziming Fu, Siling Zhou, Jiangong Li
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, Chunyang Wu
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, Jing Liu
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot
LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions
Nhat Hoang-Xuan, Minh Vu, My T. Thai
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
Grounding Multimodal Large Language Models in Actions
Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang
Needle In A Multimodal Haystack
Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu
Large Language Model-empowered multimodal strain sensory system for shape recognition, monitoring, and human interaction of tensegrity
Zebing Mao, Ryota Kobayashi, Hiroyuki Nabae, Koichi Suzumori
Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems
Mohammed Elhenawy, Ahmed Abdelhay, Taqwa I. Alhadidi, Huthaifa I Ashqar, Shadi Jaradat, Ahmed Jaber, Sebastien Glaser, Andry Rakotonirainy