Multimodal Comprehension
Multimodal comprehension focuses on enabling artificial intelligence systems to understand and reason using information from multiple sources, such as text and images or video and audio. Current research emphasizes improving the accuracy and robustness of large vision-language models (LVLMs) by addressing issues like hallucinations (generating inaccurate information) and improving their ability to handle long, complex multimodal inputs, often through novel training-free methods or by enhancing attention mechanisms. This field is significant because it underpins advancements in various applications, including medical image analysis, educational tools, and more generally, creating more human-like AI capable of understanding rich, real-world information.
Papers
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
Needle In A Multimodal Haystack
Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang