Multimodal Understanding
Multimodal understanding focuses on enabling machines to comprehend and integrate information from multiple sources like text, images, audio, and video, mirroring human cognitive abilities. Current research emphasizes developing large multimodal language models (MLLMs) using various architectures, including transformers and diffusion models, often incorporating techniques like instruction tuning and knowledge fusion to improve performance on diverse tasks. This field is crucial for advancing artificial general intelligence and has significant implications for applications ranging from robotics and human-computer interaction to scientific discovery and creative content generation.
Papers
Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios
Jiaqi Fan, Jianhua Wu, Hongqing Chu, Quanbo Ge, Bingzhao Gao
Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024
Jiahan Li, Zhiqi Li, Tong Lu
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang, Kun Gai, Zhen Lei
Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions
Shezheng Song