Efficient Inference

Efficient inference for large language models (LLMs) aims to reduce the substantial computational cost and memory demands of LLM deployment, enabling wider accessibility and practical applications. Current research focuses on techniques like model compression (quantization, pruning, knowledge distillation), optimized decoding strategies (speculative decoding, early exiting), and novel architectures (e.g., linear attention mechanisms, recurrent networks) to improve speed and resource efficiency. These advancements are crucial for deploying LLMs on resource-constrained devices and reducing the environmental impact of their operation, impacting both scientific research and various industries.

Papers

December 5, 2023

Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit
Fanfei Meng, Lele Zhang, Yu Chen, Yuxin Wang
Transformer Based Multi Layer Thompson Sampling Efficient Inference Hierarchical Transformer Head Motion Dynamic Transformer

December 4, 2023

Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off
Sachit Kuhar, Yash Jain, Alexey Tumanov
Deep Neural Network DNN Framework Efficient Inference Binarization Method Enhancing Efficiency Partial Binarization

November 30, 2023

Semiparametric Efficient Inference in Adaptive Experiments
Thomas Cook, Alan Mishler, Aaditya Ramdas
Efficient Inference Semi Parametric Adaptive Experiment Policy Estimation Sequential Experiment Sequential Inference

October 29, 2023

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai "Helen" Li, Yiran Chen
High Efficiency Mixture of Expert Potential Scalability Efficient Inference Efficient Model Model Capacity Sparsity Aware

October 18, 2023

Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models
Pascal Kündig, Fabio Sigrist
Gaussian Process Efficient Inference Iterative Method

October 14, 2023

Edge-InversionNet: Enabling Efficient Inference of InversionNet on Edge Devices
Zhepeng Wang, Isaacshubhanand Putla, Weiwen Jiang, Youzuo Lin
Edge Device Efficient Inference Full Waveform Inversion

October 4, 2023

xVal: A Continuous Numerical Tokenization for Scientific Language Models
Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho
Efficient Inference Data Encoding Adaptive Token Various Number Specific Tokenization Scheme

October 2, 2023

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications
Duc N. M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang Wang
Large Language Model Knowledge Based Efficient Inference Future Implication Experimental Study Inference Latency LLM Compression Dynamic Prompting

September 23, 2023

Early-Exit with Class Exclusion for Efficient Inference of Neural Networks
Jingcun Wang, Bing Li, Grace Li Zhang
Neural Network Deep Neural Network Efficient Inference Early Exit Early eXit Dynamic Inference Class Interference

September 22, 2023

Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences
Hugo Malard, Salah Zaiem, Robin Algayres
Automatic Speech Recognition Large Model Efficient Inference Automatic Speech Recognition Model Inference Cost Whisper Encoder

July 27, 2023

Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs
Or Sharir, Anima Anandkumar
Deep Learning Transformer Architecture Incremental Learning Efficient Inference Iterative Inference Incremental Algorithm Computable Learning Partial Input

July 26, 2023

DPBERT: Efficient Inference for BERT based on Dynamic Planning
Weixin Wu, Hankz Hankui Zhuo
Ticket BERT Efficient Inference Scale Pre Trained Language Model Adaptive Inference Original BERT Dynamic Planning

July 1, 2023

Q-YOLO: Efficient Inference for Real-time Object Detection
Mingze Wang, Huixin Sun, Jun Shi, Xuhui Liu, Baochang Zhang, Xianbin Cao
Object Detection Model Efficient Inference Real Time Object Activation Quantization

June 13, 2023

E2E-LOAD: End-to-End Long-form Online Action Detection
Shuqiang Cao, Weixin Luo, Bairui Wang, Wei Zhang, Lin Ma
Spatio Temporal Efficient Inference Online Action Detection

June 4, 2023

Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
Wangchunshu Zhou, Ronan Le Bras, Yejin Choi
Transformer Megatron Decepticons Transformer Model Multi Layer Model Compression Efficient Inference Sequence to Sequence Model Encoder Decoder Framework Transformer Module

May 24, 2023

PruMUX: Augmenting Data Multiplexing with Model Compression
Yushan Su, Vishvak Murahari, Karthik Narasimhan, Kai Li
Language Model Model Compression Efficient Inference Model Pruning Worst Case User Throughput Data Multiplexing BERT Base

May 23, 2023

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference
Ruiqi Sun, Siwei Ye, Jie Zhao, Xin He, Yiran Li, An Zou
Neural Network Deep Neural Network Efficient Inference Matrix Multiplication

May 17, 2023

PaLM 2 Technical Report
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, Yonghui Wu
Reasoning Task Reasoning Capability Technical Report Efficient Inference Palm ID State of the Art Language Robust Reasoning

April 17, 2023

Compositional Probabilistic and Causal Inference using Tractable Circuit Models
Benjie Wang, Marta Kwiatkowska
Causal Inference Efficient Inference Compositional Data Probabilistic Circuit Tractable Probabilistic Model Probabilistic Interpretation Circuit Complexity

March 24, 2023

EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms
Shikhar Tuli, Niraj K. Jha
Transformer Architecture Transformer Model Efficient Inference Efficient Transformer Large Transformer Model Transformer Performance Edge Platform