Various Number Specific Tokenization Scheme
Various number-specific tokenization schemes are being investigated to optimize the performance of large language models (LLMs) on numerical reasoning tasks and across different languages. Research focuses on comparing the effectiveness of different tokenization algorithms (like Byte Pair Encoding) and vocabulary sizes, analyzing the impact of morphological awareness and tokenization direction (left-to-right vs. right-to-left), and exploring novel encoding methods like continuous number representations. These efforts aim to improve the accuracy and efficiency of LLMs, particularly in scientific applications and for under-resourced languages, by addressing limitations in existing tokenization approaches.
Papers
October 23, 2024
October 4, 2024
April 20, 2024
March 31, 2024
March 17, 2024
February 22, 2024
December 26, 2023
October 12, 2023
October 4, 2023
April 21, 2023