Hierarchical Token Semantic Audio Transformer

Hierarchical Token Semantic Audio Transformers (HTS-ATs) are a class of deep learning models designed to improve audio processing tasks, particularly sound classification, emotion recognition, and audio-language alignment. Current research focuses on enhancing HTS-AT performance in challenging acoustic environments (e.g., reverberation) using multi-microphone inputs and integrating them with other modalities like video or text via large language models. This approach offers significant improvements in accuracy and efficiency for various applications, including speech recognition, speaker verification, and multimodal content generation, by leveraging hierarchical processing and discrete audio representations.

Papers