Codec Language Model
Codec language models (CLMs) represent a significant advancement in speech synthesis, aiming to generate high-quality speech from text prompts by treating the task as a language modeling problem using discrete audio tokens. Current research focuses on improving CLM robustness, addressing issues like inconsistent token representations and recency bias, often through multi-scale coding and generation, chain-of-thought prompting, and incorporating human feedback for improved alignment with user preferences. These models hold considerable promise for applications such as text-to-speech, voice conversion, and even speech enhancement and anonymization, offering improvements in naturalness, speaker similarity, and efficiency compared to previous methods.
Papers
Towards General-Purpose Text-Instruction-Guided Voice Conversion
Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-yiin Chang, Hung-yi Lee
Speaker anonymization using neural audio codec language models
Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas Evans