Diarization Performance

Speaker diarization, the task of identifying who spoke when in a conversation, aims to improve the accuracy and efficiency of audio analysis. Current research focuses on refining end-to-end neural diarization models, often incorporating techniques like vector clustering and attractors, as well as exploring generative methods and the integration of speech separation and voice activity detection. Improving diarization accuracy is crucial for downstream applications such as speech recognition, and recent work highlights the potential of large language models for post-processing correction and the use of multimodal data (e.g., video) to enhance performance.

Papers