Audio Spectrogram Transformer
Audio Spectrogram Transformers (ASTs) are a class of deep learning models designed to process audio data by representing it as spectrograms, then leveraging transformer architectures for feature extraction and classification. Current research focuses on improving AST efficiency (e.g., through token merging and alternative architectures like state space models), enhancing their robustness to noise and variations in recording devices, and developing effective pre-training and fine-tuning strategies for various downstream tasks such as sound event detection, speech synthesis, and respiratory sound classification. This work is significant because it pushes the boundaries of audio analysis, enabling more accurate and efficient applications in diverse fields ranging from environmental monitoring to medical diagnostics.
Papers
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung
A Frame-based Attention Interpretation Method for Relevant Acoustic Feature Extraction in Long Speech Depression Detection
Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia