Large Language Model Inference
Large language model (LLM) inference research focuses on optimizing the speed and efficiency of generating text from LLMs, aiming to reduce computational costs and latency without sacrificing accuracy. Current efforts concentrate on techniques like quantization, model compression (including pruning and knowledge distillation), improved caching strategies (especially for key-value stores), and novel decoding methods such as speculative decoding and multi-token generation. These advancements are crucial for deploying LLMs on resource-constrained devices and for making large-scale LLM applications more economically and environmentally sustainable.
Papers
July 17, 2024
July 12, 2024
July 10, 2024
July 4, 2024
June 25, 2024
June 16, 2024
June 12, 2024
June 10, 2024
May 24, 2024
May 12, 2024
May 7, 2024
May 3, 2024
April 24, 2024
April 14, 2024
April 9, 2024
April 7, 2024
March 29, 2024
March 26, 2024
March 19, 2024
February 29, 2024