Automatic Speech Recognition
Automatic Speech Recognition (ASR) aims to accurately transcribe spoken language into text, driving research into robust and efficient models. Current efforts focus on improving accuracy and robustness through techniques like consistency regularization in Connectionist Temporal Classification (CTC), leveraging pre-trained multilingual models for low-resource languages, and integrating Large Language Models (LLMs) for enhanced contextual understanding and improved handling of diverse accents and speech disorders. These advancements have significant implications for accessibility, enabling applications in diverse fields such as healthcare, education, and human-computer interaction.
Papers
Cleanformer: A multichannel array configuration-invariant neural enhancement frontend for ASR in smart speakers
Joseph Caroselli, Arun Narayanan, Nathan Howard, Tom O'Malley
Speech Detection For Child-Clinician Conversations In Danish For Low-Resource In-The-Wild Conditions: A Case Study
Sneha Das, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line. H. Clemmensen
Understanding Audio Features via Trainable Basis Functions
Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans
Disappeared Command: Spoofing Attack On Automatic Speech Recognition Systems with Sound Masking
Jinghui Xu, Jifeng Zhu, Yong Yang
Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
Keqi Deng, Shinji Watanabe, Jiatong Shi, Siddhant Arora
An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition
Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen
HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition
Ji Won Yoon, Beom Jun Woo, Nam Soo Kim
Self-critical Sequence Training for Automatic Speech Recognition
Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur
Building an ASR Error Robust Spoken Virtual Patient System in a Highly Class-Imbalanced Scenario Without Speech Data
Vishal Sunder, Prashant Serai, Eric Fosler-Lussier
Fusion of Self-supervised Learned Models for MOS Prediction
Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre, Raphael Rubino, Yi Zhao