Efficient Inference
Efficient inference for large language models (LLMs) aims to reduce the substantial computational cost and memory demands of LLM deployment, enabling wider accessibility and practical applications. Current research focuses on techniques like model compression (quantization, pruning, knowledge distillation), optimized decoding strategies (speculative decoding, early exiting), and novel architectures (e.g., linear attention mechanisms, recurrent networks) to improve speed and resource efficiency. These advancements are crucial for deploying LLMs on resource-constrained devices and reducing the environmental impact of their operation, impacting both scientific research and various industries.
Papers
November 14, 2022
September 26, 2022
August 21, 2022
August 19, 2022
August 5, 2022
July 12, 2022
June 30, 2022
June 22, 2022
June 20, 2022
May 24, 2022
May 22, 2022
May 20, 2022
April 29, 2022
April 7, 2022
March 3, 2022
February 15, 2022
January 25, 2022
December 8, 2021
December 6, 2021