Codec Language Model

Codec language models (CLMs) represent a significant advancement in speech synthesis, aiming to generate high-quality speech from text prompts by treating the task as a language modeling problem using discrete audio tokens. Current research focuses on improving CLM robustness, addressing issues like inconsistent token representations and recency bias, often through multi-scale coding and generation, chain-of-thought prompting, and incorporating human feedback for improved alignment with user preferences. These models hold considerable promise for applications such as text-to-speech, voice conversion, and even speech enhancement and anonymization, offering improvements in naturalness, speaker similarity, and efficiency compared to previous methods.

Papers

July 20, 2023

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer
Daegyeom Kim, Seongho Hong, Yong-Hoon Choi
Text Modality Synthesized Speech Expressive Speech Synthesis Codec Language Model Speech Synthesizer

May 25, 2023

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, Jinyu Li, Furu Wei
Speech Recognition View Translation Critical Synthesis Speech to Text Codec Language Model Decoder Only Model Auto Regressive Transformer

March 7, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
Speech Synthesis Zero Shot Cross Lingual Codec Language Model Zero Shot Speech

January 5, 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
Text Modality Speech Synthesis Neural Audio Zero Shot Text to Speech Codec Language Model Speech Synthesizer

Codec Language Model

Papers

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers