Paper ID: 2410.04091

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Allahdadi Fatemeh, Mahdian Toroghi Rahil, Zareian Hassan

Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and transformer architecture. By employing a pre-trained XLSR-53 network for feature extraction and a Hough transform for detection, our model effectively searches for user-defined spoken terms within any audio file. Experimental results across four languages demonstrate significant performance gains (19-54%) over a CNN-based baseline. While processing time is improved compared to DTW, accuracy remains inferior. Notably, our model offers the advantage of accurately counting query term repetitions within the target audio.

Submitted: Oct 5, 2024