Speech Foundation Model
Speech foundation models are large, pre-trained neural networks designed to learn generalizable representations from vast amounts of unlabeled speech data, enabling efficient adaptation to various downstream tasks. Current research emphasizes improving their performance on challenging scenarios like child speech, noisy environments, and low-resource languages, often employing techniques like parameter-efficient fine-tuning and model ensembles, with architectures such as Whisper, Wav2Vec2, and HuBERT playing prominent roles. This work is significant for its potential to advance applications in healthcare (mental health diagnosis, speech disorder assessment), accessibility (improved speech recognition for diverse populations), and security (deepfake detection).
Papers
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
Hanin Atwany, Abdul Waheed, Rita Singh, Monojit Choudhury, Bhiksha RajMBZUAI●Carnegie Mellon UniversityOn the Robust Approximation of ASR Metrics
Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha RajCarnegie Mellon University●MBZUAI
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo NegriTowards a Speech Foundation Model for Singapore and Beyond
Muhammad Huzaifah, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Nancy F. Chen, Ai Ti Aw