Speech Representation
Speech representation research focuses on creating effective numerical encodings of spoken language, aiming to capture both linguistic content and speaker-specific characteristics for various downstream tasks like speech recognition and voice conversion. Current research heavily utilizes transformer-based architectures and self-supervised learning methods, exploring techniques like masked prediction and contrastive learning to learn robust representations from large, unlabeled datasets. These advancements are driving improvements in efficiency and accuracy across numerous applications, including automatic speech recognition, speaker identification, and speech synthesis, while also revealing insights into the internal workings of these complex models. Furthermore, efforts are underway to improve the disentanglement of content and speaker information within these representations, leading to more robust and versatile models.
Papers
Disentangled Latent Speech Representation for Automatic Pathological Intelligibility Assessment
Tobias Weise, Philipp Klumpp, Kubilay Can Demir, Andreas Maier, Elmar Noeth, Bjoern Heismann, Maria Schuster, Seung Hee Yang
Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning
Eesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim
Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition
Ashish Seth, Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks
Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, Hung-yi Lee