Automatic Speech Recognition
Automatic Speech Recognition (ASR) aims to accurately transcribe spoken language into text, driving research into robust and efficient models. Current efforts focus on improving accuracy and robustness through techniques like consistency regularization in Connectionist Temporal Classification (CTC), leveraging pre-trained multilingual models for low-resource languages, and integrating Large Language Models (LLMs) for enhanced contextual understanding and improved handling of diverse accents and speech disorders. These advancements have significant implications for accessibility, enabling applications in diverse fields such as healthcare, education, and human-computer interaction.
Papers
Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens
Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg
An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition
Yi-Cheng Wang, Li-Ting Pai, Bi-Cheng Yan, Hsin-Wei Wang, Chi-Han Lin, Berlin Chen
Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking
Jihyun Lee, Solee Im, Wonjun Lee, Gary Geunbae Lee
Retrieval Augmented Correction of Named Entity Speech Recognition Errors
Ernest Pusateri, Anmol Walia, Anirudh Kashi, Bortik Bandyopadhyay, Nadia Hyder, Sayantan Mahinder, Raviteja Anantha, Daben Liu, Sashank Gondala
Evaluation of real-time transcriptions using end-to-end ASR models
Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge
Hongfei Xue, Rong Gong, Mingchen Shao, Xin Xu, Lezhi Wang, Lei Xie, Hui Bu, Jiaming Zhou, Yong Qin, Jun Du, Ming Li, Binbin Zhang, Bin Jia
Probing self-attention in self-supervised speech models for cross-linguistic differences
Sai Gopinath, Joselyn Rodriguez
Quantification of stylistic differences in human- and ASR-produced transcripts of African American English
Annika Heuser, Tyler Kendall, Miguel del Rio, Quinten McNamara, Nishchal Bhandari, Corey Miller, Migüel Jetté
LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey
Comparing Discrete and Continuous Space LLMs for Speech Recognition
Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu
Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications
Korbinian Kuhn, Verena Kersken, Gottfried Zimmermann
Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models
Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng