Grapheme Based Encoding

Grapheme-based encoding represents words as sequences of graphemes—the smallest units of writing corresponding to sounds—offering an alternative to subword-based methods in natural language processing. Current research focuses on improving the robustness and fairness of grapheme encoding across diverse languages, particularly those with complex writing systems, often employing neural network architectures like CNNs and transformers, and exploring techniques like grapheme pair encoding (GPE) to enhance performance. This approach holds significant promise for advancing language modeling, speech recognition, and optical character recognition, particularly for low-resource languages and those with significant orthographic variation.

Papers