Vision Language Task
Vision-language tasks aim to bridge the gap between visual and textual information, enabling machines to understand and generate descriptions, answer questions, and perform complex reasoning based on both image and text data. Current research focuses on improving model efficiency and robustness, particularly through innovative pre-training strategies, parameter-efficient fine-tuning methods, and the development of more interpretable architectures like transformers and multimodal large language models (MLLMs). These advancements are significant for applications in assistive technologies, improving the accessibility and usability of AI systems across various domains, and furthering our understanding of multimodal learning.
Papers
Scaling 4D Representations
João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model
Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue