Diarization System

Speaker diarization aims to identify "who spoke when" in an audio recording, a crucial preprocessing step for various speech applications. Current research emphasizes developing more efficient and accurate systems, focusing on both modular approaches (combining embedding extraction, clustering, and other modules) and end-to-end neural models (like transformers and those based on Mask2Former architecture) that directly predict speaker labels. These advancements are improving the accuracy of diarization, particularly in handling overlapping speech and multiple speakers, leading to better performance in downstream tasks such as speech recognition and meeting transcription.

Papers