End to End Speech Recognition

End-to-end speech recognition aims to directly transcribe speech into text without the intermediate steps of traditional hybrid systems, improving efficiency and potentially accuracy. Current research focuses on addressing limitations such as robustness to noise and unseen words, often employing transformer-based architectures, connectionist temporal classification (CTC), and techniques like data augmentation and speaker adaptation to enhance performance. These advancements are significant for improving the accuracy and applicability of speech recognition across diverse accents, languages, and noisy environments, impacting fields ranging from voice assistants to healthcare applications.

Papers