Active Speaker Detection
Active speaker detection (ASD) aims to identify who is speaking in a video, a crucial task for applications like video conferencing, human-robot interaction, and automatic video editing. Current research emphasizes improving the robustness and efficiency of ASD systems, particularly in noisy environments and with multiple speakers, often employing deep learning models such as transformers, convolutional recurrent neural networks (CRNNs), and graph neural networks (GNNs) that integrate audio and visual information. These advancements are driving progress in areas like real-time processing, handling occlusions and off-screen speakers, and improving accuracy across diverse datasets and challenging conditions, ultimately impacting fields ranging from assistive technologies to media analysis.
Papers
Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech
Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li
A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism
Ilya Gurvich, Ido Leichter, Dharmendar Reddy Palle, Yossi Asher, Alon Vinnikov, Igor Abramovski, Vishak Gopal, Ross Cutler, Eyal Krupka