Multi Speaker Automatic Speech Recognition

Multi-speaker automatic speech recognition (ASR) aims to accurately transcribe speech from recordings containing multiple overlapping speakers, a challenging problem with significant real-world applications. Current research focuses on improving the robustness of ASR models to overlapping speech and noise, often employing techniques like speech separation, advanced attention mechanisms (e.g., cross-channel attention), and non-autoregressive architectures such as Paraformer to enhance speed and accuracy. These advancements are driven by the need for more efficient and accurate transcription in scenarios like meetings and multi-party conversations, impacting fields ranging from voice assistants to meeting summarization.

Papers