Speech Model

Speech models aim to represent and process spoken language computationally, enabling applications like automatic speech recognition (ASR) and text-to-speech (TTS). Current research emphasizes improving model robustness (e.g., to noise and accents), fairness (mitigating biases against marginalized language varieties), and efficiency (through techniques like knowledge distillation and low-rank adaptation), often utilizing transformer-based architectures and self-supervised learning. These advancements have significant implications for various fields, including healthcare (e.g., voice disorder detection, mental health assessment), language preservation, and human-computer interaction.

Papers

December 6, 2023

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus
Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang, Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi
Self Supervised Learning Low Resource Language Speech Representation Speech Processing Speech Model Chinese Language

November 22, 2023

Efficient Deep Speech Understanding at the Edge
Rongxiang Wang, Felix Xiaozhu Lin
Automatic Speech Recognition Extreme Edge Spoken Language Understanding Speech Model Connectionist Temporal Classification Inference Framework Attention Based Encoder

November 21, 2023

Adapting pretrained speech model for Mandarin lyrics transcription and alignment
Jun-You Wang, Chon-In Leong, Yu-Chen Lin, Li Su, Jyh-Shing Roger Jang
Alignment Problem Speech Model Singing Voice Automatic Lyric Transcription Lyric Transcription

November 17, 2023

A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness
Mathias Vogel
Latent Space Study Feature Text to Speech Speech Model Text to Speech Model Latent Speech 1 WL Expressiveness

November 16, 2023

Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
Maike Züfle, Verna Dankers, Ivan Titov
Latent Space Hate Speech Detection Speech Model Intermediate Latent Hidden Representation Data Splitting

October 29, 2023

Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition
Isaac Slaughter, Craig Greenberg, Reva Schwartz, Aylin Caliskan
Language Model Speech Emotion Recognition Speech Model Human Bias Intrinsic Bias

October 26, 2023

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana
Owen Henkel, Hannah Horne-Robinson, Libby Hills, Bill Roberts, Joshua McGrane
Speech Model Speech Transcription Oral Reading Foundational Literacy

September 25, 2023

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
Open Source Speech Model Available Datasets Open Whisper Style Speech Model

September 18, 2023

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee
Speech Analysis Instruction Tuning Speech Processing Speech Model Natural Language Model Universal Speech Model Universal Performance Benchmark

September 14, 2023

Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Eliana Pastor, Alkis Koudounas, Giuseppe Attanasio, Dirk Hovy, Elena Baralis
Explainable Artificial Intelligence Spoken Language Understanding Speech Model Classification Model Paralinguistic Feature Audio Segmentation

August 30, 2023

LLaSM: Large Language and Speech Model
Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi
Language Model Vision Language Model Multi Modal Large Language Model Large Language Speech Model Multimodal Instruction

August 17, 2023

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition
Anant Singh, Akshat Gupta
Speech Emotion Recognition Speech Representation Speech Model Cross Linguistic Emotion Understanding

August 14, 2023

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
Shaan Bijwadia, Shuo-yiin Chang, Weiran Wang, Zhong Meng, Hao Zhang, Tara N. Sainath
Automatic Speech Recognition Speech Model Text Block Turn Taking Prediction

July 11, 2023

On the Effectiveness of Speech Self-supervised Learning for Music
Yinghao Ma, Ruibin Yuan, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Ruibo Liu, Gus Xia, Roger Dannenberg, Yike Guo, Jie Fu
Self Supervised Speech Model Music Industry Music Information Retrieval Speech Supervised Learning Model Recent Language Model

June 30, 2023

What Do Self-Supervised Speech Models Know About Words?
Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu
Word List Speech Model Self Supervised Speech Model Word Segmentation

June 14, 2023

SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma
Self Supervised Speech Representation Spoken Language Understanding Speech Model Speech Supervised Learning Model Linguistic Knowledge

June 13, 2023

Efficient Adapters for Giant Speech Models
Nanxin Chen, Izhak Shafran, Yu Zhang, Chung-Cheng Chiu, Hagen Soltau, James Qin, Yonghui Wu
Fine Tuning Self Supervised Pre Trained Model Speech Model Speech Task Efficient Adapter

June 8, 2023

Latent Phrase Matching for Dysarthric Speech
Colin Lea, Dianna Yee, Jaya Narain, Zifang Huang, Lauren Tooley, Jeffrey P. Bigham, Leah Findlater
Speech Recognition Speech Model Dysarthric Speech Atypical Speech

May 30, 2023

MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models
Yu-Hsiang Wang, Huang-Yu Chen, Kai-Wei Chang, Winston Hsu, Hung-yi Lee
Self Supervised Learning Speech Model Self Supervised Speech Model Speech Supervised Learning Model

May 23, 2023

Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino
Self Supervised Speech Analysis Audio Representation Speech Model

Speech Model

Papers

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

Efficient Deep Speech Understanding at the Edge

Adapting pretrained speech model for Mandarin lyrics transcription and alignment

A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness

Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study

Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition

Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features

LLaSM: Large Language and Speech Model

Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

On the Effectiveness of Speech Self-supervised Learning for Music

What Do Self-Supervised Speech Models Know About Words?

SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Efficient Adapters for Giant Speech Models

Latent Phrase Matching for Dysarthric Speech

MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models

Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation