Based Audio Retrieval

Based audio retrieval focuses on efficiently finding specific audio segments within large datasets using various query types, including text descriptions and example audio snippets. Current research emphasizes improving the accuracy and robustness of retrieval, particularly for challenging scenarios like noisy audio or rare words, often employing transformer-based architectures and contrastive learning methods to generate effective audio embeddings. These advancements are crucial for improving applications ranging from audio indexing and search to more sophisticated multimodal tasks like audio-visual video segmentation and direct speech translation.

Papers