Multimodal Understanding
Multimodal understanding focuses on enabling machines to comprehend and integrate information from multiple sources like text, images, audio, and video, mirroring human cognitive abilities. Current research emphasizes developing large multimodal language models (MLLMs) using various architectures, including transformers and diffusion models, often incorporating techniques like instruction tuning and knowledge fusion to improve performance on diverse tasks. This field is crucial for advancing artificial general intelligence and has significant implications for applications ranging from robotics and human-computer interaction to scientific discovery and creative content generation.
Papers
January 22, 2024
January 2, 2024
December 20, 2023
December 6, 2023
December 1, 2023
November 27, 2023
November 15, 2023
November 8, 2023
November 7, 2023
October 14, 2023
October 13, 2023
October 6, 2023
September 11, 2023
August 27, 2023
August 19, 2023
July 14, 2023
July 13, 2023
July 12, 2023
July 11, 2023