Audio Spectrogram Transformer

Audio Spectrogram Transformers (ASTs) are a class of deep learning models designed to process audio data by representing it as spectrograms, then leveraging transformer architectures for feature extraction and classification. Current research focuses on improving AST efficiency (e.g., through token merging and alternative architectures like state space models), enhancing their robustness to noise and variations in recording devices, and developing effective pre-training and fine-tuning strategies for various downstream tasks such as sound event detection, speech synthesis, and respiratory sound classification. This work is significant because it pushes the boundaries of audio analysis, enabling more accurate and efficient applications in diverse fields ranging from environmental monitoring to medical diagnostics.

Papers