Byte Pair Encoding
Byte Pair Encoding (BPE) is a subword tokenization algorithm initially designed for data compression, now widely used in natural language processing and increasingly applied to other modalities like symbolic music, molecular graphs, and images. Current research focuses on optimizing BPE for various data types, including exploring its effectiveness in multilingual settings and addressing issues like frequency imbalances and the optimal number of subword units for specific tasks. This adaptable technique significantly improves the handling of out-of-vocabulary words and enhances performance in diverse machine learning applications, from language modeling and machine translation to molecular generation and multimodal understanding.
Papers
November 15, 2024
November 13, 2024
November 10, 2024
November 8, 2024
October 31, 2024
October 3, 2024
October 2, 2024
September 29, 2024
July 26, 2024
April 27, 2024
March 15, 2024
January 28, 2024
June 29, 2023
May 4, 2023
January 27, 2023
August 10, 2022