Inference Time
Inference time, the time taken for a model to process an input and produce an output, is a critical factor in the performance and scalability of large language models (LLMs) and other deep learning systems. Current research focuses on optimizing inference efficiency through techniques like adaptive sampling, architecture search for efficient inference-time techniques, and model compression methods, aiming to reduce computational costs without sacrificing accuracy. These advancements are crucial for deploying LLMs in resource-constrained environments and improving the responsiveness of AI applications, impacting both the efficiency of AI systems and their accessibility to a wider range of users.
Papers
June 20, 2024
June 18, 2024
June 17, 2024
June 10, 2024
June 4, 2024
May 29, 2024
May 25, 2024
May 15, 2024
April 7, 2024
March 26, 2024
February 18, 2024
January 11, 2024
January 3, 2024
December 19, 2023
November 28, 2023
November 17, 2023
November 15, 2023
November 11, 2023
October 31, 2023
October 30, 2023