Fast Inference
Fast inference in machine learning aims to accelerate the process of obtaining predictions from complex models, addressing the computational bottleneck hindering the deployment of powerful models like large language models and vision transformers. Current research focuses on techniques such as speculative decoding, model compression (including pruning and quantization), and architectural innovations like mixture-of-experts and hierarchical attention mechanisms to achieve speedups. These advancements are crucial for deploying sophisticated AI models in resource-constrained environments and real-time applications, impacting fields ranging from natural language processing and computer vision to astrophysics and robotics.
Papers
ODEFormer: Symbolic Regression of Dynamical Systems with Transformers
Stéphane d'Ascoli, Sören Becker, Alexander Mathis, Philippe Schwaller, Niki Kilbertus
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun