Vocabulary Expansion

Vocabulary expansion for large language models (LLMs) aims to enhance their performance on tasks involving languages or domains beyond their initial training data, primarily by increasing the model's understanding of new words and concepts. Current research focuses on efficient methods for adding new vocabulary items, including optimal subset selection, innovative embedding initialization techniques (like Constrained Word2Vec), and effective continual pre-training strategies, even with limited data. These advancements are crucial for improving the multilingual capabilities of LLMs and enabling their application in specialized domains like medicine, where existing models often struggle with domain-specific terminology. The resulting improvements in efficiency and accuracy have significant implications for both fundamental NLP research and practical applications across various fields.

Papers