Speech Recognition
Speech recognition (ASR) aims to automatically transcribe spoken language into text, with current research heavily focused on improving accuracy and robustness across diverse conditions. This involves exploring various model architectures, including transformers, conformers, and large language models (LLMs), often incorporating techniques like connectionist temporal classification (CTC), attention mechanisms, and multimodal integration (audio-visual). Significant efforts are also dedicated to addressing challenges in low-resource languages and noisy environments, as well as enhancing accessibility for individuals with speech impairments. Advances in ASR have broad implications for numerous applications, from virtual assistants and transcription services to improving accessibility for people with disabilities and facilitating cross-lingual communication.
Papers
Personalized Speech Recognition for Children with Test-Time Adaptation
Zhonghao Shi, Harshvardhan Srivastava, Xuan Shi, Shrikanth Narayanan, Maja J. Matarić
Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations
Jonatan Bartolini, Todor Stoyanov, Alberto Giaretta
Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition
Chien-Chun Wang, Li-Wei Chen, Cheng-Kang Chou, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
Yihan Wu, Yifan Peng, Yichen Lu, Xuankai Chang, Ruihua Song, Shinji Watanabe
Large Language Models Are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, Maja Pantic
Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations
Haopeng Geng, Daisuke Saito, Minematsu Nobuaki
M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli
WER We Stand: Benchmarking Urdu ASR Models
Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul
Speech Recognition for Analysis of Police Radio Communication
Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul
Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill
CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments
Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, Jing Liu, Carol Espy-Wilson
WhisperNER: Unified Open Named Entity and Speech Recognition
Gil Ayache, Menachem Pirchi, Aviv Navon, Aviv Shamsian, Gill Hetz, Joseph Keshet
The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
Full-text Error Correction for Chinese Speech Recognition with Large Language Model
Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang