Speaker Embeddings
Speaker embeddings are numerical representations of speakers' voices, aiming to capture unique vocal characteristics for tasks like speaker recognition, diarization, and speech synthesis. Current research focuses on improving embedding robustness to noise and variations (e.g., through disentanglement techniques and adversarial training), enhancing their utility in multi-speaker scenarios (e.g., using recursive attention pooling and demultiplexing), and integrating them with other models (e.g., large language models and speech enhancement systems). These advancements have significant implications for improving the accuracy and efficiency of various speech processing applications, including improved privacy-preserving techniques and more natural-sounding speech synthesis.
Papers
Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders
Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He
Hierarchical speaker representation for target speaker extraction
Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang, Xueliang Zhang