Speech to Text
Speech-to-text (STT) research aims to accurately and efficiently convert spoken language into written text, encompassing tasks like automatic speech recognition and speech translation. Current efforts focus on improving model robustness and accuracy, particularly for low-resource languages and challenging audio conditions, often leveraging large language models (LLMs) and transformer-based architectures like Whisper and Conformer, alongside techniques like data augmentation and transfer learning. These advancements have significant implications for accessibility, enabling improved human-computer interaction and facilitating the development of more inclusive and versatile applications across various fields.
Papers
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach
Siqi Li, Danni Liu, Jan Niehues
Text-To-Speech Synthesis In The Wild
Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe