Word Segmentation

Word segmentation, the task of dividing continuous text or speech into individual words, is crucial for various natural language processing applications, particularly in morphologically rich or unsegmented languages like those of East Asia. Current research emphasizes unsupervised methods, leveraging self-supervised speech models (like HuBERT and wav2vec2.0) and dynamic programming algorithms to discover word boundaries in audio, often incorporating contextual information and visual grounding for improved accuracy. These advancements are improving performance in low-resource scenarios and enabling applications such as speech recognition, machine translation, and sentiment analysis across diverse languages.

Papers