Speech Representation
Speech representation research focuses on creating effective numerical encodings of spoken language, aiming to capture both linguistic content and speaker-specific characteristics for various downstream tasks like speech recognition and voice conversion. Current research heavily utilizes transformer-based architectures and self-supervised learning methods, exploring techniques like masked prediction and contrastive learning to learn robust representations from large, unlabeled datasets. These advancements are driving improvements in efficiency and accuracy across numerous applications, including automatic speech recognition, speaker identification, and speech synthesis, while also revealing insights into the internal workings of these complex models. Furthermore, efforts are underway to improve the disentanglement of content and speaker information within these representations, leading to more robust and versatile models.
Papers
Label Aware Speech Representation Learning For Language Identification
Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han, Vera Axelrod, Partha Talukdar
A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment
Xavier F. Cadet, Ranya Aloufi, Sara Ahmadi-Abhari, Hamed Haddadi
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models
Marvin Lavechin, Yaya Sy, Hadrien Titeux, María Andrea Cruz Blandón, Okko Räsänen, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia
Task-Agnostic Structured Pruning of Speech Representation Models
Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan
Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà