Multimodal LLM
Multimodal Large Language Models (MLLMs) aim to integrate diverse data modalities, such as text, images, and video, into a unified framework for enhanced understanding and generation. Current research emphasizes efficient fusion of visual and textual information, often employing techniques like early fusion mechanisms and specialized adapters within transformer-based architectures, as well as exploring the use of Mixture-of-Experts (MoE) models. This field is significant due to its potential to improve various applications, including image captioning, visual question answering, and more complex tasks requiring cross-modal reasoning, while also addressing challenges like hallucinations and bias.
Papers
October 1, 2023
September 28, 2023
September 27, 2023
September 19, 2023
September 13, 2023
September 11, 2023
August 8, 2023
July 31, 2023
July 19, 2023
July 18, 2023
July 17, 2023
June 27, 2023
June 1, 2023
May 2, 2023
April 25, 2023
March 23, 2023