Inference Speed

Inference speed, the time taken for a machine learning model to process input and produce output, is a critical factor limiting the deployment of powerful models in resource-constrained environments and real-time applications. Current research focuses on optimizing various model architectures, including transformers and diffusion models, through techniques like knowledge distillation, model pruning, parallel decoding, and early exiting, aiming to significantly reduce latency without sacrificing accuracy. These advancements are crucial for expanding the practical applications of large language models, computer vision systems, and other computationally intensive AI systems across diverse platforms, from smartphones to embedded devices.

Papers

June 13, 2024

June 10, 2024

PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, Haibo Chen
Large Language Model Inference Inference Speed Smartphone Device Based Inference

June 8, 2024

Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications
Zhou Zhou, Guohang He, Zheng Zhang, Luziwei Leng, Qinghai Guo, Jianxing Liao, Xuan Song, Ran Cheng
Brain Computer Interface Gated Recurrent Unit Inference Speed Edge Network Backbone Fine Tuning Edge Application Neural Decoder

June 3, 2024

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM
Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang
Large Language Model Medical LLM Inference Speed Novel Architecture Token Sampling Decoder Only Architecture

May 11, 2024

Replication Study and Benchmarking of Real-Time Object Detection Models
Pierre-Luc Asselin, Vincent Coulombe, William Guimont-Martin, William Larrivée-Hardy
Object Detection Model Benchmark Platform Real Time Object Inference Speed Accurate Model Anchor Free Replication Study

May 7, 2024

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu
Large Language Model High Efficiency Long Context Inference Efficiency Retrieval Augmented Language Model Inference Speed Personal Memory Pre Training Corpus

April 29, 2024

Accelerating Production LLMs with Combined Token/Embedding Speculators
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa
Large Language Model Scientific Inference Inference Speed Post Trade Allocation Draft Model

April 14, 2024

Exploring and Improving Drafts in Blockwise Parallel Decoding
Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton
Language Model Neural Language Model Autoregressive Language Model Inference Speed Token Generation Parallel Decoding K$ Draft

March 29, 2024

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie
Large Language Model Large Language Model Inference Inference Speed Transformer XL Uncertainty Aware Deployment Mobile GPUs Mobile Inference

March 22, 2024

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
Human Attention Many Parameter Multi Modal Large Language Model Information Redundancy Multi Head Attention Inference Speed Efficient Transfer Learning Skip Attention

February 13, 2024

Tandem Transformers for Inference Efficient LLMs
Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli
Large Language Model Visual Representation Next Token Prediction Large Language Model Inference Inference Speed

December 20, 2023

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Yao Zhao, Zhitian Xie, Chen Liang, Chenyi Zhuang, Jinjie Gu
Large Language Model Inference Speed Consistent Generation Inference Acceleration

December 19, 2023

Distilling Autoregressive Models to Obtain High-Performance Non-Autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed
Yubin Xiao, Di Wang, Boyang Li, Mingzhao Wang, Xuan Wu, Changliang Zhou, You Zhou
Knowledge Distillation Non Autoregressive Inference Speed Vehicle Routing Problem Neural Construction Autoregressive Distillation

December 12, 2023

LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Large Language Model Medical LLM Large Language Model Inference Inference Speed Limited Memory 1T1C DRAM Cell

December 8, 2023

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
Medical LLM Scientific Inference LLM Inference Inference Speed Early Exit Large Scale Training 3D Parallelism

November 6, 2023

Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Ju Hwang, Alexander Min
Language Model Knowledge Distillation Linear Compression Mutual Distillation Inference Speed Enhanced Quality Expensive Language Model Co Distillation

October 14, 2023

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner
Mengfei Xia, Yujun Shen, Changsong Lei, Yu Zhou, Ran Yi, Deli Zhao, Wenping Wang, Yong-jin Liu
Diffusion Model Inference Speed Denoising Step

September 21, 2023

Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition
Walid Ahmed, Habib Hajimolahoseini, Austin Wen, Yang Liu
Linear Compression ImageNet Dataset Inference Speed Low Rank Decomposition Compression Method ResNet Architecture

September 11, 2023

Understanding the Impact of Post-Training Quantization on Large Language Models
Somnath Roy
Large Language Model Neural Network Global Impact Post Training Quantization Inference Speed

Inference Speed

Papers

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

Replication Study and Benchmarking of Real-Time Object Detection Models

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Accelerating Production LLMs with Combined Token/Embedding Speculators

Exploring and Improving Drafts in Blockwise Parallel Decoding

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

Tandem Transformers for Inference Efficient LLMs

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Distilling Autoregressive Models to Obtain High-Performance Non-Autoregressive Solvers for Vehicle Routing Problems with Faster Inference Speed

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

Co-training and Co-distillation for Quality Improvement and Compression of Language Models

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition

Understanding the Impact of Post-Training Quantization on Large Language Models