Paper ID: 2404.01953

Classifying Graphemes in English Words Through the Application of a Fuzzy Inference System

Samuel Rose, Chandrasekhar Kambhampati

In Linguistics, a grapheme is a written unit of a writing system corresponding to a phonological sound. In Natural Language Processing tasks, written language is analysed through two different mediums, word analysis, and character analysis. This paper focuses on a third approach, the analysis of graphemes. Graphemes have advantages over word and character analysis by being self-contained representations of phonetic sounds. Due to the nature of splitting a word into graphemes being based on complex, non-binary rules, the application of fuzzy logic would provide a suitable medium upon which to predict the number of graphemes in a word. This paper proposes the application of a Fuzzy Inference System to split words into their graphemes. This Fuzzy Inference System results in a correct prediction of the number of graphemes in a word 50.18% of the time, with 93.51% being within a margin of +- 1 from the correct classification. Given the variety in language, graphemes are tied with pronunciation and therefore can change depending on a regional accent/dialect, the +- 1 accuracy represents the impreciseness of grapheme classification when regional variances are accounted for. To give a baseline of comparison, a second method involving a recursive IPA mapping exercise using a pronunciation dictionary was developed to allow for comparisons to be made.

Submitted: Apr 2, 2024