Speech Representation
Speech representation research focuses on creating effective numerical encodings of spoken language, aiming to capture both linguistic content and speaker-specific characteristics for various downstream tasks like speech recognition and voice conversion. Current research heavily utilizes transformer-based architectures and self-supervised learning methods, exploring techniques like masked prediction and contrastive learning to learn robust representations from large, unlabeled datasets. These advancements are driving improvements in efficiency and accuracy across numerous applications, including automatic speech recognition, speaker identification, and speech synthesis, while also revealing insights into the internal workings of these complex models. Furthermore, efforts are underway to improve the disentanglement of content and speaker information within these representations, leading to more robust and versatile models.
Papers
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Yonggan Fu, Yang Zhang, Kaizhi Qian, Zhifan Ye, Zhongzhi Yu, Cheng-I Lai, Yingyan Lin
data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup
Vasista Sai Lodagala, Sreyan Ghosh, S. Umesh
Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations
Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky
Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation
Chendong Zhao, Jianzong Wang, Xiaoyang Qu, Haoqian Wang, Jing Xiao