Fast Inference
Fast inference in machine learning aims to accelerate the process of obtaining predictions from complex models, addressing the computational bottleneck hindering the deployment of powerful models like large language models and vision transformers. Current research focuses on techniques such as speculative decoding, model compression (including pruning and quantization), and architectural innovations like mixture-of-experts and hierarchical attention mechanisms to achieve speedups. These advancements are crucial for deploying sophisticated AI models in resource-constrained environments and real-time applications, impacting fields ranging from natural language processing and computer vision to astrophysics and robotics.
Papers
January 10, 2025
January 5, 2025
January 3, 2025
December 20, 2024
December 19, 2024
December 6, 2024
November 2, 2024
October 29, 2024
October 18, 2024
October 1, 2024
September 8, 2024
August 14, 2024
July 22, 2024
July 12, 2024
July 9, 2024
July 2, 2024
July 1, 2024
June 24, 2024
June 18, 2024