MLLM Training
Multimodal large language model (MLLM) training focuses on developing AI systems capable of understanding and generating content across multiple modalities like text, images, and video. Current research emphasizes improving MLLM efficiency through techniques like knowledge distillation and model compression, as well as enhancing their performance on specific tasks such as visual question answering and embodied agent control, often using instruction tuning and preference learning. This field is significant due to the potential of MLLMs to revolutionize various applications, from healthcare diagnostics to robotics, by enabling more human-like interaction with complex data.
Papers
Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
Haomiao Xiong, Yunzhi Zhuge, Jiawen Zhu, Lu Zhang, Huchuan Lu
Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan