Fast Inference
Fast inference in machine learning aims to accelerate the process of obtaining predictions from complex models, addressing the computational bottleneck hindering the deployment of powerful models like large language models and vision transformers. Current research focuses on techniques such as speculative decoding, model compression (including pruning and quantization), and architectural innovations like mixture-of-experts and hierarchical attention mechanisms to achieve speedups. These advancements are crucial for deploying sophisticated AI models in resource-constrained environments and real-time applications, impacting fields ranging from natural language processing and computer vision to astrophysics and robotics.
Papers
September 27, 2023
September 15, 2023
September 10, 2023
August 22, 2023
August 17, 2023
July 17, 2023
June 5, 2023
June 2, 2023
May 21, 2023
May 17, 2023
May 15, 2023
April 20, 2023
April 11, 2023
March 29, 2023
March 23, 2023
March 12, 2023
March 8, 2023
January 21, 2023
January 19, 2023