Speech Foundation Model
Speech foundation models are large, pre-trained neural networks designed to learn generalizable representations from vast amounts of unlabeled speech data, enabling efficient adaptation to various downstream tasks. Current research emphasizes improving their performance on challenging scenarios like child speech, noisy environments, and low-resource languages, often employing techniques like parameter-efficient fine-tuning and model ensembles, with architectures such as Whisper, Wav2Vec2, and HuBERT playing prominent roles. This work is significant for its potential to advance applications in healthcare (mental health diagnosis, speech disorder assessment), accessibility (improved speech recognition for diverse populations), and security (deepfake detection).
Papers
SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang
Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan
Retrieval Augmented End-to-End Spoken Dialog Models
Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction
Daniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland R. Barnard, David T. Jones, Hugo Botha