Overlapped Speech
Overlapped speech, the simultaneous utterance of multiple speakers, presents a significant challenge for automatic speech recognition (ASR) systems. Current research focuses on developing end-to-end models, often employing architectures like Connectionist Temporal Classification (CTC) or encoder-decoder networks with serialized output training (SOT), to simultaneously separate and transcribe overlapping speech, often integrating speaker diarization (identifying "who spoke when"). These advancements aim to improve the accuracy of ASR in real-world scenarios like meetings and conversations, impacting fields ranging from human-computer interaction to social science research through improved transcription and analysis of multi-speaker audio.
Papers
Multitask Detection of Speaker Changes, Overlapping Speech and Voice Activity Using wav2vec 2.0
Marie Kunešová, Zbyněk Zajíc
In search of strong embedding extractors for speaker diarisation
Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown, Youngki Kwon, Shinji Watanabe, Joon Son Chung