Visual Word

Visual words represent image features as discrete units, analogous to words in natural language, enabling efficient image analysis and bridging the semantic gap between low-level features and high-level understanding. Current research focuses on leveraging visual words within various architectures, including bag-of-visual-words (BoVW) models and auto-regressive approaches integrated with Large Language Models (LLMs), to improve tasks such as image annotation, retrieval, and weakly-supervised semantic segmentation. This work is significant for advancing computer vision by enabling more robust and interpretable image understanding, with applications ranging from medical image analysis to improved multimedia search and retrieval.

Papers