Paper ID: 2410.13328

Enhancing 1-Second 3D SELD Performance with Filter Bank Analysis and SCConv Integration in CST-Former

Zhehui Zhang

Recent SELD research has predominantly focused on long-time segment scenarios (typically 5 to 10 seconds, occasionally 2 seconds), improving benchmark performance but lacking the temporal granularity needed for real-world applications. To bridge this gap, this paper investigates SELD with distance estimation (3D SELD) systems under short-time segments, specifically targeting a 1-second window, establishing a new baseline for practical 3D SELD applicability. We further explore the impact of different filter banks -- Bark, Mel, and Gammatone for audio feature extraction, and experimental results demonstrate that the Gammatone filter achieves the highest overall accuracy in this context. Finally, we propose replacing the convolutional modules within the CST-Former, a competitive SELD architecture, with the SCConv module. This adjustment yields measurable F-score gains in short-segment scenarios, underscoring SCConv's potential to improve spatial and channel feature representation. The experimental results highlight our approach as a significant step towards the real-world deployment of 3D SELD systems under low-latency constraints.

Submitted: Oct 17, 2024