Speech Foundation Model
Speech foundation models are large, pre-trained neural networks designed to learn generalizable representations from vast amounts of unlabeled speech data, enabling efficient adaptation to various downstream tasks. Current research emphasizes improving their performance on challenging scenarios like child speech, noisy environments, and low-resource languages, often employing techniques like parameter-efficient fine-tuning and model ensembles, with architectures such as Whisper, Wav2Vec2, and HuBERT playing prominent roles. This work is significant for its potential to advance applications in healthcare (mental health diagnosis, speech disorder assessment), accessibility (improved speech recognition for diverse populations), and security (deepfake detection).
Papers
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri
Towards a Speech Foundation Model for Singapore and Beyond
Muhammad Huzaifah, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Nancy F. Chen, Ai Ti Aw