Vision Language Task
Vision-language tasks aim to bridge the gap between visual and textual information, enabling machines to understand and generate descriptions, answer questions, and perform complex reasoning based on both image and text data. Current research focuses on improving model efficiency and robustness, particularly through innovative pre-training strategies, parameter-efficient fine-tuning methods, and the development of more interpretable architectures like transformers and multimodal large language models (MLLMs). These advancements are significant for applications in assistive technologies, improving the accessibility and usability of AI systems across various domains, and furthering our understanding of multimodal learning.
Papers
Dynamic Prompting: A Unified Framework for Prompt Tuning
Xianjun Yang, Wei Cheng, Xujiang Zhao, Wenchao Yu, Linda Petzold, Haifeng Chen
Naming Objects for Vision-and-Language Manipulation
Tokuhiro Nishikawa, Kazumi Aoyama, Shunichi Sekiguchi, Takayoshi Takayanagi, Jianing Wu, Yu Ishihara, Tamaki Kojima, Jerry Jun Yokono