Automatic Speech Recognition
Automatic Speech Recognition (ASR) aims to accurately transcribe spoken language into text, driving research into robust and efficient models. Current efforts focus on improving accuracy and robustness through techniques like consistency regularization in Connectionist Temporal Classification (CTC), leveraging pre-trained multilingual models for low-resource languages, and integrating Large Language Models (LLMs) for enhanced contextual understanding and improved handling of diverse accents and speech disorders. These advancements have significant implications for accessibility, enabling applications in diverse fields such as healthcare, education, and human-computer interaction.
Papers
Building Accurate Low Latency ASR for Streaming Voice Search
Abhinav Goyal, Nikesh Garera
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet, Petr Motlicek
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning
Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe
Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition
Xiaoliang Wu, Peter Bell, Ajitha Rajan
speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition
Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang, Xiaobao Wang, Shiliang Zhang
Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator
Lingwei Meng, Jiawen Kang, Mingyu Cui, Haibin Wu, Xixin Wu, Helen Meng
ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition
Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai
INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition
Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo
Svarah: Evaluating English ASR Systems on Indian Accents
Tahir Javed, Sakshi Joshi, Vignesh Nagarajan, Sai Sundaresan, Janki Nawale, Abhigyan Raman, Kaushal Bhogale, Pratyush Kumar, Mitesh M. Khapra
RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models
David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He
Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
Kaushal Santosh Bhogale, Sai Sundaresan, Abhigyan Raman, Tahir Javed, Mitesh M. Khapra, Pratyush Kumar
Iteratively Improving Speech Recognition and Voice Conversion
Mayank Kumar Singh, Naoya Takahashi, Onoe Naoyuki
InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition
Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei, Song-Lu Chen, Feng Chen, Xu-Cheng Yin
Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person
Lucas Rafael Stefanel Gris, Ricardo Marcacini, Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Sandra Maria Aluísio
On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications
Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, Arthur Pimentel, Anderson Avila, Mehdi Rezagholizadeh, Boxing Chen, Tiago Falk
Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding
Tian-Hao Zhang, Hai-Bo Qin, Zhi-Hao Lai, Song-Lu Chen, Qi Liu, Feng Chen, Xinyuan Qian, Xu-Cheng Yin
SE-Bridge: Speech Enhancement with Consistent Brownian Bridge
Zhibin Qiu, Mengfan Fu, Fuchun Sun, Gulila Altenbek, Hao Huang
Personalized Predictive ASR for Latency Reduction in Voice Assistants
Andreas Schwarz, Di He, Maarten Van Segbroeck, Mohammed Hethnawi, Ariya Rastrow
Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features
Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Xun Chen