Paper ID: 2304.14780

Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

Felix Stollenwerk

This paper provides a detailed discussion of the multilingual tokenizer used for GPT-SW3. It was trained on the Nordic Pile using the SentencePiece library and the BPE algorithm. We outline the tokenizer's most important features and share details on its learned vocabulary. In addition, we systematically analyze the properties and evaluate the performance of the tokenizer with regard to the different languages present in the data.

Submitted: Apr 28, 2023

Topics

Training Data
Global Evaluation
GPT 3
Online Tokenizer
Non Contiguous Piece
Multilingual Tokenizer
BPE Vocabulary

Links

arXiv PDF