Efficient Inference
Efficient inference for large language models (LLMs) aims to reduce the substantial computational cost and memory demands of LLM deployment, enabling wider accessibility and practical applications. Current research focuses on techniques like model compression (quantization, pruning, knowledge distillation), optimized decoding strategies (speculative decoding, early exiting), and novel architectures (e.g., linear attention mechanisms, recurrent networks) to improve speed and resource efficiency. These advancements are crucial for deploying LLMs on resource-constrained devices and reducing the environmental impact of their operation, impacting both scientific research and various industries.
Papers
Computation-Aware Kalman Filtering and Smoothing
Marvin Pförtner, Jonathan Wenger, Jon Cockayne, Philipp Hennig
Addressing Misspecification in Simulation-based Inference through Data-driven Calibration
Antoine Wehenkel, Juan L. Gamella, Ozan Sener, Jens Behrmann, Guillermo Sapiro, Marco Cuturi, Jörn-Henrik Jacobsen