MLLM Training

Multimodal large language model (MLLM) training focuses on developing AI systems capable of understanding and generating content across multiple modalities like text, images, and video. Current research emphasizes improving MLLM efficiency through techniques like knowledge distillation and model compression, as well as enhancing their performance on specific tasks such as visual question answering and embodied agent control, often using instruction tuning and preference learning. This field is significant due to the potential of MLLMs to revolutionize various applications, from healthcare diagnostics to robotics, by enabling more human-like interaction with complex data.

Papers

October 8, 2024

$\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable Deepfake Detection
Yize Chen, Zhiyuan Yan, Siwei Lyu, Baoyuan Wu
New Framework Deepfake Detection Deep Fake MLLM Training

October 6, 2024

MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration
Lai Wei, Wenkai Wang, Xiaoyu Shen, Yu Xie, Zhihao Fan, Xiaojin Zhang, Zhongyu Wei, Wei Chen
Zero Shot Medical LLM Multimodal Large Language Model Chain of Thought MLLM Training

October 4, 2024

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
Junpeng Yue, Xinru Xu, Börje F. Karlsson, Zongqing Lu
Embodied Agent MLLM Training Multimodal Retrieval MLLM Attention Hybrid Retriever Multimodal Trajectory Prediction MLLM Security

September 26, 2024

Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing
Huthaifa I. Ashqar, Ahmed Jaber, Taqwa I. Alhadidi, Mohammed Elhenawy
Multimodal Large Language Model Detection Task Large Vision Model Transportation System MLLM Training

September 21, 2024

A Survey on Multimodal Benchmarks: In the Era of Large AI Models
Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, Long Chen
Timely Survey Multimodal Large Language Model New Era Multimodal Benchmark Model Architecture MLLM Training Large AI Model Multimodal Content

September 17, 2024

Surveying the MLLM Landscape: A Meta-Review of Current Surveys
Ming Li, Keyu Chen, Ziqian Bi, Ming Liu, Benji Peng, Qian Niu, Junyu Liu, Jinlang Wang, Sen Zhang, Xuanhe Pan, Jiawei Xu, Pohsun Feng
Timely Survey Multimodal Large Language Model Multi Modality Unimodal Model MLLM Training MLLM Attention Meta Review

September 14, 2024

From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice
Qian Niu, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Lawrence KQ Yan, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Junyu Liu, Benji Peng, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu
Global Impact Text Modality Multimodal Large Language Model Modality Alignment MLLM Training

September 6, 2024

Question-Answering Dense Video Events
Hangyu Qin, Junbin Xiao, Angela Yao
Multimodal Large Language Model MLLM Training Dense Video

September 2, 2024

Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings
Aya El Mir, Lukelo Thadei Luoga, Boyuan Chen, Muhammad Abdullah Hanif, Muhammad Shafique
Healthcare System Multi Modal Large Language Model Resource Constrained MLLM Training Modal Large Language Model Low Computational

August 28, 2024

August 9, 2024

Revisiting Multi-Modal LLM Evaluation
Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan
Multi Modal Large Language Model 3d Vqa MLLM Training Expression Comprehension Multi Modal LLM

August 7, 2024

August 3, 2024

MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun
Multimodal Large Language Model GPT 4 Artificial Intelligence Research MLLM Training

July 1, 2024

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan
Global Evaluation Multimodal Large Language Model Multimodal LLM MT Bench MLLM Training Complex Instruction Diverse Instruction Instruction Quality

June 27, 2024

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin
Multimodal Large Language Model Large Multimodal Model Document Understanding Visual Token MLLM Training

June 18, 2024

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding
Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, Haifeng Li
Data Set Foundation Model GPT 4 Remote Sensing Image Image Understanding MLLM Training

June 17, 2024

AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jiaming Liu, Ruiping Wang, Hao Dong
MLLM Training MLLM Attention Robust Manipulation Stable Manipulation

May 17, 2024

Efficient Multimodal Large Language Models: A Survey
Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Timely Survey Multimodal Large Language Model MLLM Training

MLLM Training

Papers

$\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable Deepfake Detection

MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice

Question-Answering Dense Video Events

Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

A Survey on Evaluation of Multimodal Large Language Models

Revisiting Multi-Modal LLM Evaluation

NatLan: Native Language Prompting Facilitates Knowledge Elicitation Through Language Trigger Provision and Domain Trigger Retention

Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

Efficient Multimodal Large Language Models: A Survey