Visual Query

Visual query research focuses on enabling computers to understand and respond to visual information as queries, mirroring how humans use images to search for information. Current efforts concentrate on improving the accuracy and efficiency of large vision-language models (LVLMs) in processing visual queries, particularly within egocentric videos and complex diagrams, often employing transformer-based architectures and novel training strategies like aggregate query sculpting or multi-axis querying to address challenges like sparsity and scalability. This field is crucial for advancing multimodal AI, impacting applications ranging from improved search engines and augmented reality systems to more efficient data exploration tools and enhanced explainability in AI models.

Papers