Large Language Model Inference
Large language model (LLM) inference research focuses on optimizing the speed and efficiency of generating text from LLMs, aiming to reduce computational costs and latency without sacrificing accuracy. Current efforts concentrate on techniques like quantization, model compression (including pruning and knowledge distillation), improved caching strategies (especially for key-value stores), and novel decoding methods such as speculative decoding and multi-token generation. These advancements are crucial for deploying LLMs on resource-constrained devices and for making large-scale LLM applications more economically and environmentally sustainable.
Papers
February 26, 2024
February 24, 2024
February 19, 2024
February 13, 2024
February 2, 2024
January 31, 2024
January 22, 2024
January 15, 2024
January 8, 2024
December 23, 2023
December 18, 2023
December 16, 2023
December 12, 2023
December 8, 2023
December 5, 2023
December 1, 2023
November 16, 2023
November 2, 2023
October 24, 2023