Speech to Text

Speech-to-text (STT) research aims to accurately and efficiently convert spoken language into written text, encompassing tasks like automatic speech recognition and speech translation. Current efforts focus on improving model robustness and accuracy, particularly for low-resource languages and challenging audio conditions, often leveraging large language models (LLMs) and transformer-based architectures like Whisper and Conformer, alongside techniques like data augmentation and transfer learning. These advancements have significant implications for accessibility, enabling improved human-computer interaction and facilitating the development of more inclusive and versatile applications across various fields.

Papers

February 28, 2023

ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus
Ajinkya Kulkarni, Atharva Kulkarni, Sara Abedalmonem Mohammad Shatnawi, Hanan Aldarmaki
End to End Speech to Text Arabic Speech Arabic Text to Speech

February 27, 2023

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model
Jaeyoung Huh, Sangjoon Park, Jeong Eun Lee, Jong Chul Ye
Automatic Speech Recognition Full Model Clinical Text Speech to Text

February 25, 2023

Prompt-based Learning for Text Readability Assessment
Bruce W. Lee, Jason Hyung-Jong Lee
Speech to Text Seq2seq Model Prompt Based Learning Readability Assessment Readability Control Pre Trained Seq2seq Model

February 3, 2023

PSST! Prosodic Speech Segmentation with Transformers
Nathan Roll, Calbert Graham, Simon Todd
Transformer Megatron Decepticons Prosodic Feature Speech to Text Prosody Modeling Teaching Intonation Assessment

January 10, 2023

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion
Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Jianhua Tao
Unified Framework Voice Conversion Speech to Text Zero Shot Text to Speech Anti Unification Speaker Modeling

December 8, 2022

Learning to Dub Movies via Hierarchical Prosody Models
Gaoxiang Cong, Liang Li, Yuankai Qi, Zhengjun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, Qingming Huang
LeArning Abstract Prosodic Feature Speech to Text High Fidelity Vocoder Visual Speech Movie Dubbing

December 3, 2022

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis
Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie, Dan Su
End to End Text to Speech Speech Synthesis Speech to Text Singing Voice Singing Voice Synthesis Conditional Variational

November 17, 2022

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models
Minki Kang, Dongchan Min, Sung Ju Hwang
Diffusion Model Speech to Text Adaptive Text to Speech

November 14, 2022

SNIPER Training: Single-Shot Sparse Training for Text-to-Speech
Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans
Training Data Text to Speech Speech to Text Sparse Training High Sparsity Dense Training Sparsity Aware

November 12, 2022

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units
Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
Voice Conversion Prosodic Feature Speech to Text One Shot Voice Conversion

November 4, 2022

Wireless Deep Speech Semantic Transmission
Zixuan Xiao, Shengshi Yao, Jincheng Dai, Sixian Wang, Kai Niu, Ping Zhang
Speech to Text Joint Source Channel Coding Semantic Transmission

October 27, 2022

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech
Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran
Speech to Text Self Supervised Speech Representation High Quality Speech Speech Text Multilingual Tt

October 26, 2022

Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition
Sharman Tan, Piyush Behre, Nick Kibre, Issac Alphonso, Shuangyu Chang
Automatic Speech Recognition Speech to Text Punctuation Mark Disfluent Speech Spoken Text Inverse Text Normalization

October 21, 2022

Named Entity Detection and Injection for Direct Speech Translation
Marco Gaido, Yun Tang, Ilia Kulikov, Rongqing Huang, Hongyu Gong, Hirofumi Inaguma
Entity Recognition Person Name Entity Mention Neural Model Speech to Text High Accuracy Direct Speech to Speech Translation Injection Drug Use

October 5, 2022

JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT
Mayumi Ohta, Julia Kreutzer, Stefan Riezler
Speech Translation Speech to Text NMT System Speech Translation Benchmark

August 28, 2022

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu, Rob Clark
Synthetic Data Text to Speech Task Transferability Speech to Text Practical Approach Accent Transfer

July 1, 2022

Swiss German Speech to Text system evaluation
Yanick Schraner, Christian Scheller, Michel Plüss, Manfred Vogel
Text to Speech Evaluation Benchmark Speech to Text Text Evaluation Parliamentary Corpus Swiss German

June 27, 2022

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding
Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-yi Lee
Transfer Learning Text to Speech Speech to Text Text to Speech Model Shot Training Cross Lingual Text to Speech

June 5, 2022

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
End to End Text to Speech Speech to Text Semantic Prior Geographic Feature Pronunciation Polyphone Disambiguation

May 25, 2022

Semantic-preserved Communication System for Highly Efficient Speech Transmission
Tianxiao Han, Qianqian Yang, Zhiguo Shi, Shibo He, Zhaoyang Zhang
Semantic Communication Speech to Text Semantic Communication System Semantic Transmission Efficient Transmission

Speech to Text

Papers

ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Prompt-based Learning for Text Readability Assessment

PSST! Prosodic Speech Segmentation with Transformers

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Learning to Dub Movies via Hierarchical Prosody Models

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models

SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

Wireless Deep Speech Semantic Transmission

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition

Named Entity Detection and Injection for Direct Speech Translation

JoeyS2T: Minimalistic Speech-to-Text Modeling with JoeyNMT

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Swiss German Speech to Text system evaluation

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Semantic-preserved Communication System for Highly Efficient Speech Transmission