Multimodal Understanding
Multimodal understanding focuses on enabling machines to comprehend and integrate information from multiple sources like text, images, audio, and video, mirroring human cognitive abilities. Current research emphasizes developing large multimodal language models (MLLMs) using various architectures, including transformers and diffusion models, often incorporating techniques like instruction tuning and knowledge fusion to improve performance on diverse tasks. This field is crucial for advancing artificial general intelligence and has significant implications for applications ranging from robotics and human-computer interaction to scientific discovery and creative content generation.
Papers
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
Xuwu Wang, Qiwen Cui, Yunzhe Tao, Yiran Wang, Ziwei Chai, Xiaotian Han, Boyi Liu, Jianbo Yuan, Jing Su, Guoyin Wang, Tingkai Liu, Liyu Chen, Tianyi Liu, Tao Sun, Yufeng Zhang, Sirui Zheng, Quanzeng You, Yang Yang, Hongxia Yang