Token Repetition

Token repetition in large language models (LLMs) and other transformer-based architectures is a significant research area focusing on identifying, mitigating, and leveraging its effects on model performance and efficiency. Current efforts involve developing methods to detect the source of repeated tokens, reducing redundancy through pruning and pooling techniques, and designing novel decoding strategies like parallel decoding to improve speed and reduce repetition during generation. Addressing token repetition is crucial for enhancing the efficiency, reliability, and safety of LLMs, impacting both the development of more resource-efficient models and the trustworthiness of their outputs.

Papers