Vocabulary Expansion
Vocabulary expansion for large language models (LLMs) aims to enhance their performance on tasks involving languages or domains beyond their initial training data, primarily by increasing the model's understanding of new words and concepts. Current research focuses on efficient methods for adding new vocabulary items, including optimal subset selection, innovative embedding initialization techniques (like Constrained Word2Vec), and effective continual pre-training strategies, even with limited data. These advancements are crucial for improving the multilingual capabilities of LLMs and enabling their application in specialized domains like medicine, where existing models often struggle with domain-specific terminology. The resulting improvements in efficiency and accuracy have significant implications for both fundamental NLP research and practical applications across various fields.
Papers
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion
Jianqing Zhu, Huang Huang, Zhihang Lin, Juhao Liang, Zhengyang Tang, Khalid Almubarak, Abdulmohsen Alharthik, Bang An, Juncai He, Xiangbo Wu, Fei Yu, Junying Chen, Zhuoheng Ma, Yuhao Du, He Zhang, Emad A. Alghamdi, Lian Zhang, Ruoyu Sun, Haizhou Li, Benyou Wang, Jinchao Xu
Vocabulary Expansion of Chat Models with Unlabeled Target Language Data
Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras