Enhanced Vocabulary
Enhancing the vocabulary of large language models (LLMs) is a key research area aiming to improve their performance and efficiency across diverse tasks and languages. Current efforts focus on optimizing vocabulary size for different model scales, developing adaptive methods for selecting optimal subsets of domain-specific words, and exploring alternative tokenization strategies beyond subwords, including multi-word entities. These advancements are significant because they directly impact LLM performance on downstream tasks, enabling more accurate and efficient natural language processing in various applications, from machine translation to information retrieval.
Papers
October 2, 2024
July 18, 2024
July 9, 2024
June 10, 2024
March 1, 2024
November 16, 2023
October 12, 2023
August 22, 2023
August 14, 2023
May 24, 2023
May 20, 2023
April 24, 2023
March 1, 2023
December 20, 2022
December 2, 2022
October 12, 2022
October 4, 2022
May 4, 2022
March 28, 2022