Visual Representation
Visual representation research focuses on creating effective ways for computers to understand and utilize visual information, primarily aiming to bridge the gap between raw image data and higher-level semantic understanding. Current research emphasizes developing robust and efficient visual representations through various techniques, including contrastive learning, masked image modeling, and the integration of vision models with large language models (LLMs), often employing transformer-based architectures. These advancements have significant implications for numerous applications, such as robotic control, medical image analysis, and improving the capabilities of multimodal AI systems.
Papers
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, Tong Lu
Unveiling the Mystery of Visual Attributes of Concrete and Abstract Concepts: Variability, Nearest Neighbors, and Challenging Categories
Tarun Tater, Sabine Schulte im Walde, Diego Frassinelli
RoboKoop: Efficient Control Conditioned Representations from Visual Input in Robotics using Koopman Operator
Hemant Kumawat, Biswadeep Chakraborty, Saibal Mukhopadhyay
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation
Aneta Pawelec, Victoria Sara Wesołowska, Zuzanna Bączek, Piotr Sankowski