Speaker Attributed Automatic Speech Recognition

Speaker-attributed automatic speech recognition (SA-ASR) aims to identify not only the spoken words in a conversation but also *who* spoke each word, a crucial step towards more natural and informative transcriptions, especially in multi-speaker scenarios. Current research heavily focuses on end-to-end models, often employing Transformer-based architectures or non-autoregressive approaches like Paraformer to improve speed and accuracy, integrating speaker diarization and speech recognition into a single system. These advancements are significantly improving the accuracy of multi-speaker transcriptions in real-world settings like meetings, leading to more robust and efficient applications in areas such as meeting summarization and human-computer interaction.

Papers