Speaker Diarization
Speaker diarization is the task of identifying "who spoke when" in an audio recording, a crucial preprocessing step for many speech applications. Current research focuses on improving accuracy and efficiency, particularly in challenging scenarios like multi-speaker conversations and noisy environments, using techniques such as end-to-end neural networks, spectral clustering, and the integration of audio-visual or semantic information. These advancements are driving progress in areas like meeting transcription, multilingual speech processing, and improving the performance of downstream tasks such as automatic speech recognition.
Papers
Generation of Speaker Representations Using Heterogeneous Training Batch Assembly
Yu-Huai Peng, Hung-Shin Lee, Pin-Tuan Huang, Hsin-Min Wang
Multi-target Extractor and Detector for Unknown-number Speaker Diarization
Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
Multi-scale Speaker Diarization with Dynamic Scale Weighting
Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
Using Active Speaker Faces for Diarization in TV shows
Rahul Sharma, Shrikanth Narayanan
The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge
Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee
Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge
Jingguang Tian, Xinhui Hu, Xinkang Xu