Speech Representation
Speech representation research focuses on creating effective numerical encodings of spoken language, aiming to capture both linguistic content and speaker-specific characteristics for various downstream tasks like speech recognition and voice conversion. Current research heavily utilizes transformer-based architectures and self-supervised learning methods, exploring techniques like masked prediction and contrastive learning to learn robust representations from large, unlabeled datasets. These advancements are driving improvements in efficiency and accuracy across numerous applications, including automatic speech recognition, speaker identification, and speech synthesis, while also revealing insights into the internal workings of these complex models. Furthermore, efforts are underway to improve the disentanglement of content and speaker information within these representations, leading to more robust and versatile models.
Papers
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
Alexander H. Liu, Sung-Lin Yeh, James Glass
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang