Auxiliary Caption

Auxiliary captions, additional textual descriptions accompanying multimedia data (audio, video, images), are increasingly used to improve the performance of various cross-modal retrieval and grounding tasks. Current research focuses on integrating these captions into models through techniques like hierarchical cross-modal interaction, attention mechanisms aligning text and multimedia features, and contrastive learning to enhance representation learning. This approach addresses limitations in existing methods by leveraging the rich semantic information in auxiliary captions to improve accuracy and robustness, particularly in scenarios with sparse or noisy annotations. The resulting advancements have significant implications for improving the performance of applications such as video grounding, audio retrieval, and sketch-based image retrieval.

Papers