Automatic Speech Recognition
Automatic Speech Recognition (ASR) aims to accurately transcribe spoken language into text, driving research into robust and efficient models. Current efforts focus on improving accuracy and robustness through techniques like consistency regularization in Connectionist Temporal Classification (CTC), leveraging pre-trained multilingual models for low-resource languages, and integrating Large Language Models (LLMs) for enhanced contextual understanding and improved handling of diverse accents and speech disorders. These advancements have significant implications for accessibility, enabling applications in diverse fields such as healthcare, education, and human-computer interaction.
Papers
Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model
Jaeyoung Huh, Sangjoon Park, Jeong Eun Lee, Jong Chul Ye
A low latency attention module for streaming self-supervised speech representation learning
Jianbo Ma, Siqi Pan, Deepak Chandran, Andrea Fanelli, Richard Cartwright
Improving Massively Multilingual ASR With Auxiliary CTC Objectives
William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe
Ensemble knowledge distillation of self-supervised speech models
Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen, Wei-Cheng Tseng, Kai-Wei Chang, Hung-yi Lee
Factual Consistency Oriented Speech Recognition
Naoyuki Kanda, Takuya Yoshioka, Yang Liu
Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition
Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng
MADI: Inter-domain Matching and Intra-domain Discrimination for Cross-domain Speech Recognition
Jiaming Zhou, Shiwan Zhao, Ning Jiang, Guoqing Zhao, Yong Qin
Federated Learning for ASR based on Wav2vec 2.0
Tuan Nguyen, Salima Mdhaffar, Natalia Tomashenko, Jean-François Bonastre, Yannick Estève
A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One
Lingwei Meng, Jiawen Kang, Mingyu Cui, Yuejiao Wang, Xixin Wu, Helen Meng
Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax
Keqi Deng, Philip C. Woodland
Speaker Change Detection for Transformer Transducer ASR
Jian Wu, Zhuo Chen, Min Hu, Xiong Xiao, Jinyu Li
Stabilising and accelerating light gated recurrent units for automatic speech recognition
Adel Moumen, Titouan Parcollet
QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion
Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro