Audio Representation
Audio representation research focuses on developing effective ways to encode audio signals for machine understanding, aiming to create models that can process and interpret diverse sounds like speech, music, and environmental noises. Current research emphasizes self-supervised learning techniques, often employing transformer-based architectures or more efficient alternatives like state space models, to learn robust representations from large, unlabeled datasets. These advancements are crucial for improving various applications, including speech recognition, music information retrieval, sound event detection, and even healthcare applications like heart murmur detection, by enabling more accurate and efficient audio processing. The development of general-purpose audio representations that perform well across diverse audio domains remains a key focus.
Papers
Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations
Sarthak Yadav, Zheng-Hua Tan
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto