Multi Modal Large Language Model
Multi-modal large language models (MLLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between human-like understanding and machine intelligence. Current research emphasizes improving the consistency and fairness of MLLMs, exploring efficient fusion mechanisms (like early fusion and Mixture-of-Experts architectures), and developing benchmarks to evaluate their performance across diverse tasks, including medical image analysis and autonomous driving. This rapidly evolving field holds significant potential for advancing various applications, from healthcare diagnostics to robotics, by enabling more robust and reliable AI systems capable of handling real-world complexities.
Papers
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Zhen Qin, Daoyuan Chen, Wenhao Zhang, Liuyi Yao, Yilun Huang, Bolin Ding, Yaliang Li, Shuiguang Deng
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
Minghui Wu, Chenxu Zhao, Anyang Su, Donglin Di, Tianyu Fu, Da An, Min He, Ya Gao, Meng Ma, Kun Yan, Ping Wang
Multi-Modal Retrieval For Large Language Model Based Speech Recognition
Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko
MMRel: A Relation Understanding Benchmark in the MLLM Era
Jiahao Nie, Gongjie Zhang, Wenbin An, Yap-Peng Tan, Alex C. Kot, Shijian Lu
Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM)
Anjanava Biswas, Wrick Talukdar