Hierarchical Token Semantic Audio Transformer
Hierarchical Token Semantic Audio Transformers (HTS-ATs) are a class of deep learning models designed to improve audio processing tasks, particularly sound classification, emotion recognition, and audio-language alignment. Current research focuses on enhancing HTS-AT performance in challenging acoustic environments (e.g., reverberation) using multi-microphone inputs and integrating them with other modalities like video or text via large language models. This approach offers significant improvements in accuracy and efficiency for various applications, including speech recognition, speaker verification, and multimodal content generation, by leveraging hierarchical processing and discrete audio representations.
Papers
September 14, 2024
June 5, 2024
May 25, 2024
November 2, 2023
October 8, 2023
September 19, 2023
December 18, 2022
February 2, 2022