Neural Diarization

Neural diarization aims to automatically identify who spoke when in an audio recording, a crucial task for applications like meeting transcription and analysis. Current research heavily focuses on end-to-end neural models, often employing encoder-decoder architectures with attractors to represent speakers, and exploring techniques like self-supervised learning and powerset multi-class formulations to improve accuracy and robustness, particularly in handling overlapping speech and varying numbers of speakers. These advancements are significantly impacting fields requiring automated speaker segmentation, leading to more efficient and accurate processing of audio data in various applications.

Papers