Vocabulary Size

Vocabulary size in large language models (LLMs) is a critical factor influencing model performance and efficiency, with recent research focusing on optimizing vocabulary size relative to model parameters and available computational resources. Studies using various architectures, including BERT and transformer-based models, demonstrate that larger vocabularies generally improve performance on downstream tasks, but only up to a certain point determined by the model's size and training data. This research is significant because it directly impacts the cost-effectiveness and performance of LLMs across diverse applications, particularly in low-resource language settings where efficient vocabulary expansion is crucial.

Papers

October 24, 2024

Dynamic Vocabulary Pruning in Early-Exit LLMs
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec
Large Language Model LLM Inference Vocabulary Size Vocabulary Trimming

July 18, 2024

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
Large Language Model Large Model Scaling Law Enhanced Vocabulary Large Vocabulary Vocabulary Size

June 24, 2024

Large Vocabulary Size Improves Large Language Models
Sho Takase, Ryokan Ri, Shun Kiyono, Takuya Kato
Large Language Model Vocabulary Size

June 17, 2024

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras
Large Language Model Cross Lingual Transfer Language Data Vocabulary Expansion Vocabulary Size Cross Lingual Vocabulary

April 29, 2024

A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system
Sunil Kumar Kopparapu, Ashish Panda
Online Tokenizer Tokenization Algorithm Vocabulary Size

March 17, 2024

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
Mohamed Taher Alrefaie, Nour Eldin Morsy, Nada Samir
Language Model Natural Language Processing Technology Tokenization Algorithm Various Number Specific Tokenization Scheme Vocabulary Size

June 29, 2022

Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices
Amit Chaulwar, Lukas Malik, Maciej Krajewski, Felix Reichel, Leif-Nissen Lundbæk, Michael Huth, Bartlomiej Matejczyk
Knowledge Distillation Fast Inference High Compression Vocabulary Size Sentence Level Distillation Large Ranking Model

December 20, 2021

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages
Matej Ulčar, Marko Robnik-Šikonja
Training Data BERT Model Multilingual Model Case Relevance Monolingual Model Multilingual BERT Multilingual Model XLM BERT Like Vocabulary Size

Vocabulary Size

Papers

Dynamic Vocabulary Pruning in Early-Exit LLMs

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Large Vocabulary Size Improves Large Language Models

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages