Vision Language Tracking
Vision-language tracking (VLT) aims to improve object tracking in videos by incorporating natural language descriptions, enhancing robustness and accuracy beyond purely visual methods. Current research focuses on developing more comprehensive benchmarks with diverse and multi-granularity textual annotations, often leveraging large language models (LLMs) to generate these descriptions, and on designing unified architectures that effectively fuse visual and linguistic information, including approaches using transformers and convolutional neural networks. This field is significant because it pushes the boundaries of multimodal learning and has the potential to improve various applications, such as autonomous driving and video understanding systems, by enabling more nuanced and robust object tracking capabilities.
Papers
How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking
Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, Zhenyu He