Efficient Inference
Efficient inference for large language models (LLMs) aims to reduce the substantial computational cost and memory demands of LLM deployment, enabling wider accessibility and practical applications. Current research focuses on techniques like model compression (quantization, pruning, knowledge distillation), optimized decoding strategies (speculative decoding, early exiting), and novel architectures (e.g., linear attention mechanisms, recurrent networks) to improve speed and resource efficiency. These advancements are crucial for deploying LLMs on resource-constrained devices and reducing the environmental impact of their operation, impacting both scientific research and various industries.
Papers
February 13, 2024
February 7, 2024
February 6, 2024
January 17, 2024
January 12, 2024
December 19, 2023
December 15, 2023
December 5, 2023
December 4, 2023
November 30, 2023
October 29, 2023
October 18, 2023
October 14, 2023
October 4, 2023
October 2, 2023
September 23, 2023
September 22, 2023
July 27, 2023
July 26, 2023