Paper ID: 2401.11311
A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models
Reda Bensaid, Vincent Gripon, François Leduc-Primeau, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux
In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives
Submitted: Jan 20, 2024