Independent Phone to Audio Alignment

Independent phone-to-audio alignment focuses on accurately mapping phonetic units to their corresponding segments in audio recordings, without relying on pre-aligned text transcriptions. Current research emphasizes leveraging self-supervised learning, diffusion models, and large language models to improve alignment accuracy, often incorporating techniques like cross-attention mechanisms and dynamic programming for optimal sequence partitioning. These advancements are crucial for enhancing various speech processing applications, including speech synthesis, keyword spotting, and multimodal sentiment analysis, particularly in scenarios with noisy or low-resource data.

Papers