MLLM Training
Multimodal large language model (MLLM) training focuses on developing AI systems capable of understanding and generating content across multiple modalities like text, images, and video. Current research emphasizes improving MLLM efficiency through techniques like knowledge distillation and model compression, as well as enhancing their performance on specific tasks such as visual question answering and embodied agent control, often using instruction tuning and preference learning. This field is significant due to the potential of MLLMs to revolutionize various applications, from healthcare diagnostics to robotics, by enabling more human-like interaction with complex data.
Papers
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang, Jingyi Zhang
NatLan: Native Language Prompting Facilitates Knowledge Elicitation Through Language Trigger Provision and Domain Trigger Retention
Baixuan Li, Yunlong Fan, Tianyi Ma, Zhiqiang Gao
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu