Inference Speed
Inference speed, the time taken for a machine learning model to process input and produce output, is a critical factor limiting the deployment of powerful models in resource-constrained environments and real-time applications. Current research focuses on optimizing various model architectures, including transformers and diffusion models, through techniques like knowledge distillation, model pruning, parallel decoding, and early exiting, aiming to significantly reduce latency without sacrificing accuracy. These advancements are crucial for expanding the practical applications of large language models, computer vision systems, and other computationally intensive AI systems across diverse platforms, from smartphones to embedded devices.
Papers
April 29, 2024
April 14, 2024
March 29, 2024
March 22, 2024
February 13, 2024
December 20, 2023
December 19, 2023
December 12, 2023
December 8, 2023
November 6, 2023
October 14, 2023
September 21, 2023
September 11, 2023
August 28, 2023
July 17, 2023
July 2, 2023
June 2, 2023
January 27, 2023
December 9, 2022