Audio Visual Diarization

Audio-visual diarization (AVD) aims to identify who is speaking and when in audio-visual recordings, a challenging task particularly in unconstrained "in-the-wild" settings. Recent research focuses on improving AVD accuracy through advanced techniques like heterogeneous graph learning and leveraging pre-trained video models such as VideoMAE, often incorporating late fusion of audio and visual streams to handle diverse scenarios with multiple speakers and varying audio-visual conditions. These advancements are driving progress towards more robust and accurate AVD systems, with significant implications for applications such as meeting transcription, video indexing, and accessibility technologies.

Papers