Fast Inference

Fast inference in machine learning aims to accelerate the process of obtaining predictions from complex models, addressing the computational bottleneck hindering the deployment of powerful models like large language models and vision transformers. Current research focuses on techniques such as speculative decoding, model compression (including pruning and quantization), and architectural innovations like mixture-of-experts and hierarchical attention mechanisms to achieve speedups. These advancements are crucial for deploying sophisticated AI models in resource-constrained environments and real-time applications, impacting fields ranging from natural language processing and computer vision to astrophysics and robotics.

Papers

April 16, 2024

Bootstrapping Linear Models for Fast Online Adaptation in Human-Agent Collaboration
Benjamin A Newman, Chris Paxton, Kris Kitani, Henny Admoni
Large Model Fast Inference Bootstrapping End to End Online Adaptation Online Regression Human Agent Collaboration

March 24, 2024

PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
Tanvir Mahmud, Burhaneddin Yaman, Chun-Hao Liu, Diana Marculescu
High Performing 1D ConvNet Fast Inference Patch Level CNN Performance Lightweight ConvNets

March 13, 2024

Fast Inference of Removal-Based Node Influence
Weikai Li, Zhiping Xiao, Xiao Luo, Yizhou Sun
GNN Model Fast Inference Influential Node GNN Inference Influence Score

February 29, 2024

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre
Language Model Recurrent Neural Network Long Sequence Fast Inference \Sigma}{\Delta}$ Low Pas RNN Local Context Recurrence Relation

February 28, 2024

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Yi Zhang, Fei Yang, Shuang Peng, Fangyu Wang, Aimin Pan
Large Language Model Fast Inference Low Bit Quantization Linear Layer Fine Grained Quantization Per Tensor

February 23, 2024

Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations
J. E. San Soucie, H. M. Sosik, Y. Girdhar
Fast Inference SpatioTemporal Data Spatial Prediction Variational Gaussian Process

February 10, 2024

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci
Mixture of Expert Fast Inference Task Offloading Multi GPU GPU Memory

February 6, 2024

LightHGNN: Distilling Hypergraph Neural Networks into MLPs for $100\times$ Faster Inference
Yifan Feng, Yihe Luo, Shihui Ying, Yue Gao
Fast Inference Hypergraph Neural Network Synthetic Graph Neural Network Inference Hypergraph Datasets

February 2, 2024

February 1, 2024

BlackMamba: Mixture of Experts for State-Space Models
Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge
Large Language Model Mixture Component Expert Knowledge State Space Model Fast Inference

December 28, 2023

Fast Inference of Mixture-of-Experts Language Models with Offloading
Artyom Eliseev, Denis Mazur
Mixture of Expert Fast Inference Mixed Quantization Offloading Robot Functionality

December 22, 2023

A Polarization Opinion Model Inspired by Bounded Confidence Communications
Jacek Cyranka, Piotr B. Mucha
Fast Inference Confidence Bound Stochastic Model Opinion Distribution Latent Preference

December 15, 2023

Learning Distributions on Manifolds with Free-Form Flows
Peter Sorrenson, Felix Draxler, Armand Rousselot, Sander Hummerich, Ullrich Köthe
Natural Gradient Class Manifold Fast Inference Riemannian Manifold Dimensional Manifold

December 9, 2023

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang
Large Language Model Language Model Extreme Edge Fast Inference Activation Quantization Weight Only Quantization

December 5, 2023

Deep Learning for Fast Inference of Mechanistic Models' Parameters
Maxim Borisyak, Stefan Born, Peter Neubauer, Mariano Nicolas Cruz-Bournazou
Deep Learning Many Parameter Fast Inference Mechanistic Model

October 17, 2023

Sensitivity-Aware Amortized Bayesian Inference
Lasse Elsemüller, Hans Olischläger, Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, Stefan T. Radev
Bayesian Inference Simulation Based Inference Fast Inference Sensitivity Aware Amortized Bayesian Inference

October 10, 2023

Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
Language Model Code Generation Inference Cost Fast Inference Query Attention Mistral 7B

October 9, 2023