Subword Tokenization
Subword tokenization is a crucial preprocessing step in natural language processing, aiming to represent text as sequences of subword units, balancing vocabulary size and the ability to handle unseen words. Current research focuses on improving the quality and efficiency of subword tokenization algorithms like Byte Pair Encoding (BPE), Unigram Language Model, and WordPiece, often within the context of specific model architectures such as transformers, and exploring their impact on downstream tasks across diverse languages and domains, including chemistry and biomedicine. These efforts are significant because effective subword tokenization directly influences the performance and robustness of language models, impacting applications ranging from machine translation and named entity recognition to more specialized areas like molecular design and scientific text analysis.
Papers
Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?
Sathvik Nair, Philip Resnik
PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications
Yang Tan, Mingchen Li, Pan Tan, Ziyi Zhou, Huiqun Yu, Guisheng Fan, Liang Hong