Speech Recognition
Speech recognition (ASR) aims to automatically transcribe spoken language into text, with current research heavily focused on improving accuracy and robustness across diverse conditions. This involves exploring various model architectures, including transformers, conformers, and large language models (LLMs), often incorporating techniques like connectionist temporal classification (CTC), attention mechanisms, and multimodal integration (audio-visual). Significant efforts are also dedicated to addressing challenges in low-resource languages and noisy environments, as well as enhancing accessibility for individuals with speech impairments. Advances in ASR have broad implications for numerous applications, from virtual assistants and transcription services to improving accessibility for people with disabilities and facilitating cross-lingual communication.
Papers
From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language
Muhammad Sharif, Zeeshan Abbas, Jiangyan Yi, Chenglin Liu
Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications
Nirmal Joshua Kapu, Raghav Karan
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams
Srija Anand, Praveen Srinivasa Varadhan, Mehak Singal, Mitesh M. Khapra
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg