Unsupervised Tokenization

Unsupervised tokenization aims to automatically divide text into meaningful units (tokens) without relying on pre-existing dictionaries or labeled data, a crucial step in natural language processing for low-resource languages. Recent research focuses on optimizing unsupervised methods by exploring novel metrics beyond traditional statistical measures, such as those based on transition probabilities or information-theoretic concepts, to improve tokenization accuracy across diverse languages. These advancements offer the potential for more robust and efficient language processing, particularly beneficial for languages lacking extensive annotated resources, and contribute to a deeper understanding of how languages evolve and structure information.

Papers

December 22, 2024

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement
Runjin Chen, Mingxuan Ju, Ngoc Bui, Dimosthenis Antypas, Stanley Cai, Xiaopeng Wu, Leonardo Neves, Zhangyang Wang, Neil Shah, Tong Zhao
Self Improvement Generative Recommendation Adaptive Tokenization Unsupervised Tokenization

March 4, 2023

Self-tuning hyper-parameters for unsupervised cross-lingual tokenization
Anton Kolonin
Cross Lingual Hyper Parameter F1 Score Unsupervised Tokenization

May 23, 2022

Unsupervised Tokenization Learning
Anton Kolonin, Vignav Ramesh
Multilingual Corpus Efficient Tokenization Unsupervised Tokenization

Unsupervised Tokenization

Papers

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement

Self-tuning hyper-parameters for unsupervised cross-lingual tokenization

Unsupervised Tokenization Learning