Large Language Model Inference

Large language model (LLM) inference research focuses on optimizing the speed and efficiency of generating text from LLMs, aiming to reduce computational costs and latency without sacrificing accuracy. Current efforts concentrate on techniques like quantization, model compression (including pruning and knowledge distillation), improved caching strategies (especially for key-value stores), and novel decoding methods such as speculative decoding and multi-token generation. These advancements are crucial for deploying LLMs on resource-constrained devices and for making large-scale LLM applications more economically and environmentally sustainable.

Papers