Speech to Text

Speech-to-text (STT) research aims to accurately and efficiently convert spoken language into written text, encompassing tasks like automatic speech recognition and speech translation. Current efforts focus on improving model robustness and accuracy, particularly for low-resource languages and challenging audio conditions, often leveraging large language models (LLMs) and transformer-based architectures like Whisper and Conformer, alongside techniques like data augmentation and transfer learning. These advancements have significant implications for accessibility, enabling improved human-computer interaction and facilitating the development of more inclusive and versatile applications across various fields.

Papers

June 19, 2024

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang
Pre Trained Speech to Text Audio Captioning Audio Token Audio Encoder Audio Coding

June 13, 2024

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe
Mixed Effect Speech to Text Speech Foundation Model Open Whisper Style Speech Model Open Large Language Model Heterogeneous Data Source

June 11, 2024

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation
Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang
Speech to Text Non Autoregressive Text to Speech Model Simultaneous Machine Translation Speech to Speech Translation End to End Speech Translation Simultaneous Speech Translation

June 10, 2024

Synthetic Query Generation using Large Language Models for Virtual Assistants
Sonal Sannigrahi, Thiago Fraga-Silva, Youssef Oualil, Christophe Van Gysel
Speech Recognition System Speech to Text SAM Prior Virtual Assistant Synthetic Query Generation Synthetic Query

June 7, 2024

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking
Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen
Text to Speech Synthesized Speech Speech to Text Agnostic Watermarking Imperceptible Watermark

June 6, 2024

A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
Himanshu Maurya, Atli Sigurgeirsson
Human in the Loop Speech to Text Prosody Modeling

June 3, 2024

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training
Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans
Adversarial Training Variational Autoencoder Speech to Text Speech Synthesizer Accent Conversion

May 16, 2024

Building a Luganda Text-to-Speech Model From Crowdsourced Data
Sulaiman Kagumire, Andrew Katumba, Joyce Nakatumba-Nabende, John Quinn
Speech to Text Text to Speech Model Crowd Sourced Data Phonetic Convergence

May 13, 2024

Semantic MIMO Systems for Speech-to-Text Transmission
Zhenzi Weng, Zhijin Qin, Huiqiang Xie, Xiaoming Tao, Khaled B. Letaief
Semantic Communication Multiple Input Multiple Output Speech to Text Semantic Communication System Semantic Network

April 10, 2024

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness
Xincan Feng, Akifumi Yoshimoto
End to End Text to Speech Large Scale Language Model Speech to Text

February 8, 2024

Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data
Bibiána Lajčinová, Patrik Valábek, Michal Spišiak
Synthetic Data Entity Recognition Named Entity Recognition Speech to Text Bidirectional Encoder Representation Address Parsing

February 2, 2024

January 22, 2024

Benchmarking Large Multimodal Models against Common Corruptions
Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin
Large Multimodal Model Text to Image Speech to Text Cross Modal Interaction Image to Text Common Corruption

January 18, 2024

Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks
Yichao Du, Zhirui Zhang, Linan Yue, Xu Huang, Yuqing Zhang, Tong Xu, Linli Xu, Enhong Chen
Automatic Speech Recognition Speech Translation Personalized Federated Learning Speech to Text Whisper Model

December 28, 2023

Accent-VITS:accent transfer for end-to-end TTS
Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie
End to End Speech to Text Accent Transfer

December 2, 2023

October 22, 2023

An overview of text-to-speech systems and media applications
Mohammad Reza Hasanabadi
Text to Speech System Description Speech to Text Text to Speech Model Synthetic Voice

September 27, 2023

Developing automatic verbatim transcripts for international multilingual meetings: an end-to-end solution
Akshat Dewan, Michal Ziemski, Henri Meylan, Lorenzo Concina, Bruno Pouliquen
Machine Translation End to End Speech to Text Automatic Transcription Meeting Transcript International Conference