WavLM Speech Encoder
WavLM is a large-scale, self-supervised speech encoder that extracts powerful representations from raw audio waveforms. Current research focuses on leveraging WavLM's pre-trained features for various downstream tasks, including speaker diarization, speech spoofing detection, and speech emotion recognition, often integrating it with other models like Conformers or employing techniques like attentive merging of hidden embeddings to optimize performance. This readily available, robust encoder is significantly impacting speech processing research by improving accuracy and efficiency across a wide range of applications, particularly where data scarcity is a limiting factor.
Papers
SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang
Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection
Zihan Pan, Tianchi Liu, Hardik B. Sailor, Qiongqiong Wang