Audio Visual Diarization
Audio-visual diarization (AVD) aims to identify who is speaking and when in audio-visual recordings, a challenging task particularly in unconstrained "in-the-wild" settings. Recent research focuses on improving AVD accuracy through advanced techniques like heterogeneous graph learning and leveraging pre-trained video models such as VideoMAE, often incorporating late fusion of audio and visual streams to handle diverse scenarios with multiple speakers and varying audio-visual conditions. These advancements are driving progress towards more robust and accurate AVD systems, with significant implications for applications such as meeting transcription, video indexing, and accessibility technologies.
Papers
June 18, 2023
November 17, 2022
November 16, 2022
November 2, 2022
October 14, 2022