Paper ID: 2307.15400

The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022

Li Zhang, Huan Zhao, Yue Li, Bowen Pang, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

This paper describes the FlySpeech speaker diarization system submitted to the second \textbf{M}ultimodal \textbf{I}nformation Based \textbf{S}peech \textbf{P}rocessing~(\textbf{MISP}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.

Submitted: Jul 28, 2023