Codec Language Model

Codec language models (CLMs) represent a significant advancement in speech synthesis, aiming to generate high-quality speech from text prompts by treating the task as a language modeling problem using discrete audio tokens. Current research focuses on improving CLM robustness, addressing issues like inconsistent token representations and recency bias, often through multi-scale coding and generation, chain-of-thought prompting, and incorporating human feedback for improved alignment with user preferences. These models hold considerable promise for applications such as text-to-speech, voice conversion, and even speech enhancement and anonymization, offering improvements in naturalness, speaker similarity, and efficiency compared to previous methods.

Papers