Various Number Specific Tokenization Scheme

Various number-specific tokenization schemes are being investigated to optimize the performance of large language models (LLMs) on numerical reasoning tasks and across different languages. Research focuses on comparing the effectiveness of different tokenization algorithms (like Byte Pair Encoding) and vocabulary sizes, analyzing the impact of morphological awareness and tokenization direction (left-to-right vs. right-to-left), and exploring novel encoding methods like continuous number representations. These efforts aim to improve the accuracy and efficiency of LLMs, particularly in scientific applications and for under-resourced languages, by addressing limitations in existing tokenization approaches.

Papers