Multimodal Phenomenon
Multimodal research focuses on developing artificial intelligence systems that can effectively process and integrate information from multiple data sources (e.g., text, images, audio, video). Current efforts concentrate on improving the robustness and accuracy of multimodal large language models (MLLMs) through techniques like chain-of-thought prompting, contrastive learning, and multimodal masked autoencoders, often addressing challenges such as hallucination mitigation and efficient resource utilization on edge devices. This field is significant because it enables more comprehensive and nuanced understanding of complex phenomena, with applications ranging from improved medical diagnosis and drug discovery to enhanced human-computer interaction and more effective educational tools. The development of robust benchmarks and open-source tools is also a key area of focus to facilitate collaborative research and development.
Papers
M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction
Jiang Liu, Bobo Li, Xinran Yang, Na Yang, Hao Fei, Mingyao Zhang, Fei Li, Donghong Ji
MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models
Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, Xuezhe Ma
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
Chaoyu Li, Eun Woo Im, Pooyan Fazli
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Video-Guided Foley Sound Generation with Multimodal Controls
Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs
Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, Lu Sheng
Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach
Shijian Deng, Wentian Zhao, Yu-Jhe Li, Kun Wan, Daniel Miranda, Ajinkya Kale, Yapeng Tian
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He
Evaluating and Advancing Multimodal Large Language Models in Ability Lens
Feng Chen, Chenhui Gou, Jing Liu, Yang Yang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Chenming Zhang, Shuai Liu, Long Chen
MEGL: Multimodal Explanation-Guided Learning
Yifei Zhang, Tianxu Jiang, Bo Pan, Jingyu Wang, Guangji Bai, Liang Zhao