Draft Model
Draft models are lightweight, auxiliary models used to accelerate the inference of large language models (LLMs) by pre-generating candidate tokens that a larger, more accurate target model then verifies. Current research focuses on improving the efficiency and accuracy of this process through techniques like multi-candidate sampling, adaptive draft lengths, and context-aware model selection, often employing recurrent neural networks or simplified transformer architectures as draft models. These advancements significantly reduce the computational cost of LLM inference, impacting both research by enabling faster experimentation and practical applications by making LLMs more accessible for resource-constrained deployments.
15papers
Papers
January 9, 2025
November 17, 2024
October 22, 2024
October 9, 2024
September 16, 2024