Vision Language Task
Vision-language tasks aim to bridge the gap between visual and textual information, enabling machines to understand and generate descriptions, answer questions, and perform complex reasoning based on both image and text data. Current research focuses on improving model efficiency and robustness, particularly through innovative pre-training strategies, parameter-efficient fine-tuning methods, and the development of more interpretable architectures like transformers and multimodal large language models (MLLMs). These advancements are significant for applications in assistive technologies, improving the accessibility and usability of AI systems across various domains, and furthering our understanding of multimodal learning.
Papers
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
IWISDM: Assessing instruction following in multimodal models at scale
Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan