Audio Representation
Audio representation research focuses on developing effective ways to encode audio signals for machine understanding, aiming to create models that can process and interpret diverse sounds like speech, music, and environmental noises. Current research emphasizes self-supervised learning techniques, often employing transformer-based architectures or more efficient alternatives like state space models, to learn robust representations from large, unlabeled datasets. These advancements are crucial for improving various applications, including speech recognition, music information retrieval, sound event detection, and even healthcare applications like heart murmur detection, by enabling more accurate and efficient audio processing. The development of general-purpose audio representations that perform well across diverse audio domains remains a key focus.
Papers
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh Sanan, Mohamed Omar
Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics
Umberto Michieli, Pablo Peso Parada, Mete Ozay
Sound reconstruction from human brain activity via a generative model with brain-like auditory features
Jong-Yun Park, Mitsuaki Tsukamoto, Misato Tanaka, Yukiyasu Kamitani
Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo
Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings
Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners
Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan