Vision Paper
Vision research currently focuses on developing robust and efficient methods for processing and understanding visual information, often integrating it with other modalities like language and touch. Key areas include improving the accuracy and efficiency of models like transformers and exploring alternatives such as Mamba and structured state space models for various tasks, ranging from object detection and segmentation to navigation and scene understanding. This work is driven by the need for improved performance in applications such as robotics, autonomous systems, medical image analysis, and assistive technologies, with a strong emphasis on addressing challenges like limited data, computational cost, and generalization to unseen scenarios.
Papers
Fully-attentive and interpretable: vision and video vision transformers for pain detection
Giacomo Fiorentini, Itir Onal Ertugrul, Albert Ali Salah
Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision
Ashvin Nair, Brian Zhu, Gokul Narayanan, Eugen Solowjow, Sergey Levine
Robot to Human Object Handover using Vision and Joint Torque Sensor Modalities
Mohammadhadi Mohandes, Behnam Moradi, Kamal Gupta, Mehran Mehrandezh
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle, Chen Cecilia Liu, Jonas Pfeiffer, Iryna Gurevych
Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation
Sayontan Ghosh, Tanvi Aggarwal, Minh Hoai, Niranjan Balasubramanian