Text to Audio Grounding

Text-to-audio grounding (TAG) focuses on aligning textual descriptions with corresponding segments within audio recordings, enabling applications like improved audio search and retrieval. Current research emphasizes weakly supervised approaches, leveraging readily available audio-text pairs without detailed sound event annotations, and exploring advanced pooling strategies and negative sampling techniques to improve model accuracy. This work is crucial for advancing automatic audio captioning evaluation, as TAG-based metrics offer a more nuanced assessment of caption quality by considering the semantic alignment between text and audio content, surpassing traditional text-based metrics.

Papers