Speech Recognition
Speech recognition (ASR) aims to automatically transcribe spoken language into text, with current research heavily focused on improving accuracy and robustness across diverse conditions. This involves exploring various model architectures, including transformers, conformers, and large language models (LLMs), often incorporating techniques like connectionist temporal classification (CTC), attention mechanisms, and multimodal integration (audio-visual). Significant efforts are also dedicated to addressing challenges in low-resource languages and noisy environments, as well as enhancing accessibility for individuals with speech impairments. Advances in ASR have broad implications for numerous applications, from virtual assistants and transcription services to improving accessibility for people with disabilities and facilitating cross-lingual communication.
Papers
Discrete Speech Unit Extraction via Independent Component Analysis
Tomohiko Nakamura, Kwanghee Choi, Keigo Hojo, Yoshiaki Bando, Satoru Fukayama, Shinji Watanabe
Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives
Christiaan Jacobs, Annelien Smith, Daleen Klop, Ondřej Klejch, Febe de Wet, Herman Kamper
Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling
Maximillian Chen, Ruoxi Sun, Sercan Ö. Arık
TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng