Speculative Sampling
Speculative sampling aims to accelerate the slow decoding process in large language models (LLMs) by predicting multiple tokens concurrently, rather than one at a time. Current research focuses on improving the accuracy and efficiency of these predictions, exploring techniques like dynamic draft trees that adapt to contextual information, optimized parallel processing on GPUs, and batched sampling for multiple sequences. These advancements significantly reduce inference latency and increase throughput, making LLMs more practical for real-world applications requiring rapid text generation.
Papers
January 9, 2025
October 27, 2024
October 23, 2024
August 28, 2024
June 24, 2024
June 16, 2024
April 24, 2024
February 24, 2024
January 26, 2024
November 22, 2023
November 8, 2023