Speech Processing

Speech processing research aims to enable computers to understand, interpret, and generate human speech, focusing on tasks like speech recognition, synthesis, and enhancement. Current efforts concentrate on improving model efficiency (e.g., using linear-complexity attention mechanisms) and robustness across diverse languages and acoustic conditions, often leveraging large language models and self-supervised learning techniques. These advancements are crucial for broader accessibility of speech technology, impacting fields ranging from healthcare (e.g., depression screening) to assistive technologies and improving human-computer interaction.

Papers

December 6, 2023

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus
Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang, Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi
Self Supervised Learning Low Resource Language Speech Representation Speech Processing Speech Model Chinese Language

November 14, 2023

ChoralSynth: Synthetic Dataset of Choral Singing
Jyoti Narang, Viviana De La Vega, Xavier Lizarraga, Oscar Mayor, Hector Parra, Jordi Janer, Xavier Serra
Synthetic Dataset Speech Processing Music Information Retrieval Vocal Performance

November 12, 2023

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
Medical LLM Speech Translation Speech Processing Tuned Llama Model Non Speech Audio

November 9, 2023

Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder Layer Optimization
Huma Ameer, Seemab Latif, Rabia Latif, Sana Mukhtar
Speech Processing Human Driving Focus Deeper Network State of the Art Whisper

October 27, 2023

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis
Self Supervised Learning Speech Recognition Pytorch Model Speech Processing Audio Processing Speech Recognition Model

October 25, 2023

ArTST: Arabic Text and Speech Transformer
Hawau Olamide Toyin, Amirbek Djanibekov, Ajinkya Kulkarni, Hanan Aldarmaki
Language Model Speech Processing Text to Speech Synthesis Speech Transformer Low Resource Text to Speech

October 19, 2023

Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks
Ming-Hao Hsu, Kai-Wei Chang, Shang-Wen Li, Hung-yi Lee
Context Learning Speech Processing Speech Classification Task

October 16, 2023

SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT
Cheol Jun Cho, Abdelrahman Mohamed, Shang-Wen Li, Alan W Black, Gopala K. Anumanchipalli
Sentence Level Speech Processing Speech Language Model Syllable Discovery

October 15, 2023

Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers
Hosein Mohebbi, Grzegorz Chrupała, Willem Zuidema, Afra Alishahi
Speech Processing Complex Pattern Encoder Decoder Model Textual Model Speech Transformer

October 5, 2023

Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset
Yiwen Shao
Technical Challenge DCU Insight AQ Speech Processing Complex Network Automatic Speech Recognition Model Multi Talker 3D Spatial

September 29, 2023

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed
Self Supervised Text to Speech Low Resource Synthesized Speech Speech Processing

September 25, 2023

DDTSE: Discriminative Diffusion Model for Target Speech Extraction
Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian
High Efficiency Pre Trained Conditional Diffusion Model Speech Processing Speech Quality Target Speech Extraction

September 18, 2023

September 16, 2023

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions
Heming Wang, Meng Yu, Hao Zhang, Chunlei Zhang, Zhongweiyang Xu, Muqiao Yang, Yixuan Zhang, Dong Yu
Pre Trained Speech Enhancement Speech Processing Speech Signal Speech Quality Pre Trained Generative Model Adverse Condition High Quality Speech Fidelity Reward General Robustness

September 15, 2023

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs
Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Shaikh Anowarul Fattah, Mohammad Saquib
Convolutional Neural Network Semi Supervised Speech Synthesis Speech Processing Synthetic Voice

September 14, 2023

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
Open Source Speech Processing Neural Speech Reproducible Research Codec Model

September 11, 2023

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Titouan Parcollet, Ha Nguyen, Solene Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli, Shucong Zhang, Alexandre Allauzen, Maximin Coavoux, Yannick Esteve, Mickael Rouvier, Jerome Goulian, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier
Self Supervised Speech Processing

September 6, 2023

Addressing the Blind Spots in Spoken Language Processing
Amit Moryossef
Natural Language Processing Sign Language Spoken Language Understanding Speech Processing Co Speech Gesture Blind Spot Non Verbal Cue

Speech Processing

Papers

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

ChoralSynth: Synthetic Dataset of Choral Singing

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs

Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder Layer Optimization

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

ArTST: Arabic Text and Speech Transformer

Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks

SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT

Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers

Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

DDTSE: Discriminative Diffusion Model for Target Speech Extraction

Do learned speech symbols follow Zipf's law?

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

Addressing the Blind Spots in Spoken Language Processing