Speaker Embeddings

Speaker embeddings are numerical representations of speakers' voices, aiming to capture unique vocal characteristics for tasks like speaker recognition, diarization, and speech synthesis. Current research focuses on improving embedding robustness to noise and variations (e.g., through disentanglement techniques and adversarial training), enhancing their utility in multi-speaker scenarios (e.g., using recursive attention pooling and demultiplexing), and integrating them with other models (e.g., large language models and speech enhancement systems). These advancements have significant implications for improving the accuracy and efficiency of various speech processing applications, including improved privacy-preserving techniques and more natural-sounding speech synthesis.

Papers

October 26, 2023

Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions
Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu
Jina Embeddings Speech Synthesis Scientific Discovery Speaker Embeddings Interpretable Direction Controllable Generation Controllable Network Learning

October 25, 2023

Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Jianwei Zhang, Suren Jayasuriya, Visar Berisha
Contrastive Loss Speaker Verification Speaker Embeddings Attribute Correlation Inter Class Feature

October 18, 2023

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification
Yuanyuan Wang, Yang Zhang, Zhiyong Wu, Zhihan Yang, Tao Wei, Kun Zou, Helen Meng
Speaker Verification Speaker Embeddings Speaker Similarity Semantic Augmentation Optimal Embeddings

October 10, 2023

Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration
Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, Ganesh Ramakrishnan, Tanmay Mahapatra, Manoj Singh
Case Study Yes No Question Speaker Embeddings Automatic Speech Recognition Model Spoken Text Independent Phone to Audio Alignment Multimodal Data Integration

October 7, 2023

Conditional Diffusion Model for Target Speaker Extraction
Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
Conditional Diffusion Model Speaker Embeddings Score Based Generative Target Speaker Extraction

September 29, 2023

Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features
Yuxiang Zhang, Zhuo Li, Jingze Lu, Wenchao Wang, Pengyuan Zhang
Synthesized Speech Speaker Embeddings Product Distribution Temporal Consistency Synthetic Speech Detection

September 26, 2023

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
Hee-Soo Heo, KiHyun Nam, Bong-Jin Lee, Youngki Kwon, Minjae Lee, You Jin Kim, Joon Son Chung
Speaker Verification Speaker Embeddings Conversation Disentanglement Session Representation

September 23, 2023

Contrastive Speaker Embedding With Sequential Disentanglement
Youzhi Tu, Man-Wai Mak, Jen-Tzung Chien
Contrastive Learning Contrastive Loss Speaker Embeddings Contrastive Example Disentanglement Framework Discriminative Speaker

September 14, 2023

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings
Arnab Das, Suhita Ghosh, Tim Polzehl, Sebastian Stober
Voice Conversion Speaker Embeddings Feature Embeddings Aware Loss

September 11, 2023

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam
Speaker Diarization Speaker Embeddings Beam Search Diarization System

September 5, 2023

Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
Minh Tran, Yufeng Yin, Mohammad Soleymani
Pre Trained Emotion Recognition Speech Emotion Recognition Speaker Embeddings Robust Speaker Representation Personalized Adaptation

June 9, 2023

Speaker Embeddings as Individuality Proxy for Voice Stress Detection
Zihan Wu, Neil Scheidwasser-Clow, Karl El Hajal, Milos Cernak
Speaker Embeddings Audio Embeddings Voice Stress

June 1, 2023

May 23, 2023

Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization
Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki
End to End Speaker Embeddings Speaker Identity Diarization System

May 9, 2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control
Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling
Zero Shot Face Image Speaker Embeddings Voice Based Lip to Speech Synthesis Voice Identity Lip to Speech

May 4, 2023

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation
Xingwei Liang, You Zou, Ruifeng Xu
Emotion Recognition Multimodal Data Speaker Embeddings Cross Modal Attention Cross Modality Potential Conversation Outcome

April 25, 2023

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge
Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu
Challenge Task Speaker Embeddings Language Representation Text to Speech Model High Fidelity Vocoder

March 7, 2023

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux
Speaker Diarization Speaker Embeddings Separation Performance Target Speaker Voice Activity Detection Meeting Transcription

Speaker Embeddings

Papers

Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer

DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification

Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration

Conditional Diffusion Model for Target Speaker Extraction

Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

Contrastive Speaker Embedding With Sequential Disentanglement

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach

Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition

Speaker Embeddings as Individuality Proxy for Voice Stress Detection

Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

Encoder-decoder multimodal speaker change detection

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

Zero-shot personalized lip-to-speech synthesis with face image based voice control

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings