Speech Driven

Speech-driven research focuses on developing computational models that effectively process and understand spoken language, encompassing tasks like speech recognition, speaker identification, and emotion detection. Current research emphasizes multi-task learning frameworks, often employing transformer-based architectures and diffusion models, to improve the robustness and efficiency of these models across diverse scenarios and languages. This field is crucial for advancing human-computer interaction, improving accessibility for individuals with communication challenges, and enabling more sophisticated applications in areas like personalized healthcare and virtual assistants.

Papers

September 20, 2023

FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Stefan Stan, Kazi Injamamul Haque, Zerrin Yumak
Diffusion Explainer Facial Animation Speech Driven 3D Facial Animation Synthesis

September 19, 2023

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition
Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg
Speech Recognition Mel Spectrogram Speech Driven Audio Token Speech Tokenization Hierarchical Token Semantic Audio Transformer Discrete Audio Representation

July 19, 2023

An analysis on the effects of speaker embedding choice in non auto-regressive TTS
Adriana Stan, Johannah O'Mahony
General Analysis Mixed Effect Speech Synthesis Speech Quality Speaker Identity Value Laden Choice Speech Foundation Model Speech Driven Non Autoregressive Text to Speech

April 9, 2023

An investigation of phrase break prediction in an End-to-End TTS system
Anandaswarup Vadapalli
Language Model End to End Comprehensive Investigation Speech Driven

March 1, 2023

DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments
Shikha Baghel, Shreyas Ramoji, Sidharth, Ranjana H, Prachi Singh, Somil Jain, Pratik Roy Chowdhuri, Kaustubh Kulkarni, Swapnil Padhi, Deepu Vijayasenan, Sriram Ganapathy
Human Language Speaker Diarization Code Mixed Speech Driven Refined Diarization Core Challenge Language Diarization

February 24, 2023

Phone and speaker spatial organization in self-supervised speech representations
Pablo Riera, Manuela Cerdeiro, Leonardo Pepino, Luciana Ferrer
Self Supervised Speech Representation Representational Similarity Speech Driven Speech Segment Spatial Structure

February 23, 2023

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification
Qiongqiong Wang, Kong Aik Lee, Tianchi Liu
High Uncertainty Anticipation Estimation Task Speaker Verification Discriminant Analysis Uncertainty Propagation Speech Driven Posterior Covariance

February 20, 2023

Personalized speech enhancement combining band-split RNN and speaker attentive module
Xiaohuai Le, Li Chen, Chao He, Yiqing Guo, Cheng Chen, Xianjun Xia, Jing Lu
Speech Enhancement Attention Module Speaker Information Speech Driven Speech Enhancement Model Personalized Speech Enhancement Signal Processing Grand Challenge

February 18, 2023

Speaker and Language Change Detection using Wav2vec2 and Whisper
Tijn Berns, Nik Vaessen, David A. van Leeuwen
Automatic Speech Recognition Transformer Network Speaker Recognition Speaker Identity State of the Art Whisper Wav2vec U Speech Driven Pre Trained Network Speaker Change Detection

January 29, 2023

Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker
Navjot Kaur, Paige Tuttosi
Speech Analysis Text to Speech Speech Synthesis Human Mind Underlying Emotion Expressive Speech Speech Driven Utterance Length

December 13, 2022

InferEM: Inferring the Speaker's Intention for Empathetic Dialogue Generation
Guoqing Lv, Jiang Li, Xiaoping Wang, Zhigang Zeng
Response Generation Human Intent Empathetic Dialogue Empathetic Response Generation Speech Driven Empathetic Response

November 9, 2022

Absolute decision corrupts absolutely: conservative online speaker diarisation
Youngki Kwon, Hee-Soo Heo, Bong-Jin Lee, You Jin Kim, Jee-weon Jung
Speaker Diarization Speech Driven Cluster Label

October 31, 2022

Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit
Hongji Wang, Chengdong Liang, Shuai Wang, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, Yanmin Qian
Speaker Verification DH Research Speaker Diarization Speaker Recognition Easy to Use Toolkit Speech Driven Production Incident Speaker Modeling

October 19, 2022

Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning
Mostafa Shahin, Beena Ahmed, Julien Epps
Multi Task Speech Recognition System Speech Driven Child Speech Age Invariant Face Recognition

October 15, 2022

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations
Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky
Self Supervised Learning Speech Representation Self Supervised Speech Model Speech Driven Speech Processing Task Channel Correlation

June 26, 2022

Transport-Oriented Feature Aggregation for Speaker Embedding Learning
Yusheng Tian, Jingyu Li, Tan Lee
LeArning Abstract Speaker Verification Speaker Embeddings Feature Aggregation Speech Driven Speaker Modeling

April 4, 2022

March 17, 2022

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding
Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin, Junhai Xu, Lin Zhang, Yantao Ji, Jianwu Dang
Speaker Verification Speaker Embeddings Time Scale Dual Branch Speech Driven Discriminative Reply

December 23, 2021

S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation
Chen Liang, Chong Yang, Jing Xu, Juyang Huang, Yongliang Wang, Yang Dong
Graph Neural Network Emotion Recognition Potential Conversation Outcome Speech Driven Stream Transformer Cross Speaker Emotion Transfer Conversation Graph

Speech Driven

Papers

FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

An investigation of phrase break prediction in an End-to-End TTS system

DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational Environments

Phone and speaker spatial organization in self-supervised speech representations

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

Personalized speech enhancement combining band-split RNN and speaker attentive module

Speaker and Language Change Detection using Wav2vec2 and Whisper

Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker

InferEM: Inferring the Speaker's Intention for Empathetic Dialogue Generation

Absolute decision corrupts absolutely: conservative online speaker diarisation

Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit

Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Transport-Oriented Feature Aggregation for Speaker Embedding Learning

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

On The Model Size Selection For Speaker Identification

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation