Inference Cost

Inference cost, the computational expense of running a machine learning model, is a critical concern, especially for large language models (LLMs) and other resource-intensive architectures. Current research focuses on reducing this cost through various techniques, including model compression (e.g., pruning, quantization, low-rank decomposition), efficient model architectures (e.g., Mixture-of-Experts, sparse networks), and optimized inference strategies (e.g., early exiting, cascading, and specialized prompt handling). Lowering inference costs is crucial for broader deployment of advanced AI models, enabling wider accessibility and reducing the environmental impact of AI computations.

Papers

July 4, 2024

Convolutional vs Large Language Models for Software Log Classification in Edge-Deployable Cellular Network Testing
Achintha Ihalage, Sayed M. Taheri, Faris Muhammad, Hamed Al-Raweshidy
Large Language Model Deep Convolutional Neural Network Inference Cost Software Classification System Log Bug Triaging

July 2, 2024

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning
Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, Dongmei Zhang
Large Language Model Inference Cost Inference Speedup Continuous Prompt

July 1, 2024

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
Large Language Model Language Model System Performance Many Sparse Inference Cost Fast Inference Based Pruning Sparse Mixture of Expert Efficient Expert

June 17, 2024

Incentivizing Quality Text Generation via Statistical Contracts
Eden Saig, Ohad Einav, Inbal Talgam-Cohen
Inference Cost Optimal Contract

May 26, 2024

High-Performance Temporal Reversible Spiking Neural Networks with $O(L)$ Training Memory and $O(1)$ Inference Cost
JiaKui Hu, Man Yao, Xuerui Qiu, Yuhong Chou, Yuxuan Cai, Ning Qiao, Yonghong Tian, Bo XU, Guoqi Li
Neural Network Inference Cost Reversible Dynamic Energy Efficient Inference Memory Efficient Training Reversible Architecture

May 24, 2024

May 14, 2024

Stable Inverse Reinforcement Learning: Policies from Control Lyapunov Landscapes
Samuel Tesfazgi, Leonhard Sprandl, Armin Lederer, Sandra Hirche
Inverse Reinforcement Learning Inference Cost Prior Policy Control Lyapunov Function Fully Connected CRFs

May 11, 2024

A Machine Learning-based Approach for Solving Recurrence Relations and its use in Cost Analysis of Logic Programs
Louis Rustenholz, Maximiliano Klemen, Miguel Ángel Carreira-Perpiñán, Pedro López-García
New Machine Hidden CoST Learning Based Approach Inference Cost Logic Program Computational Cost SMT Solver Recurrence Relation

May 9, 2024

Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost
Yuan Gao, Weizhong Zhang, Wenhan Luo, Lin Ma, Jin-Gang Yu, Gui-Song Xia, Jiayi Ma
Neural Architecture Search Inference Cost Auxiliary Learning Task Inference

May 7, 2024

Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks
Caelin Kaplan, Tareq Si Salem, Angelo Rodio, Chuan Xu, Giovanni Neglia
Case Relevance Inference Cost Inference Task Local Inference Early Exit Network

April 12, 2024

CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models
Donghyun Lee, Je-Yong Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini
Sparsity Increase Inference Cost Activation Sparsity Automatic Thresholding Contextual Sparsity

April 2, 2024

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations
Georgy Tyukin
Model Compression Inference Cost Inference Efficiency Token Generation Novel Architecture Transformer Based LLM

April 1, 2024

Can LLMs get help from other LLMs without revealing private information?
Florian Hartmann, Duc-Hieu Tran, Peter Kairouz, Victor Cărbune, Blaise Aguera y Arcas
Large Language Model Privacy Preserving Private Data Inference Cost HELP Request Cascade System

March 11, 2024

SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees
Saehan Jo, Immanuel Trummer
Large Language Model Language Model Natural Language Processing Inference Cost LLM Framework Accuracy Guarantee

February 22, 2024

Divide-or-Conquer? Which Part Should You Distill Your LLM?
Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang
Medical LLM Complex Reasoning Reasoning Task Inference Cost Different PaRT Problem Decomposition Cube and Conquer

February 12, 2024

Retrieval-Augmented Thought Process as Sequential Decision Making
Thomas Pouplin, Hao Sun, Samuel Holt, Mihaela van der Schaar
Inference Cost Retrieval Augmented Language Model Sequential Decision Making Idea Generation Retrieval Augmented Thought

February 9, 2024

Learn To be Efficient: Build Structured Sparsity in Large Language Models
Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash
High Efficiency Inference Cost Structured Sparsity Activation Sparsity Dynamic Sparse Training

February 7, 2024

Online Cascade Learning for Efficient Inference over Streams
Lunyiu Nie, Zhimin Ding, Erdong Hu, Christopher Jermaine, Swarat Chaudhuri
LLM Inference Efficient Inference Inference Cost Link Stream Cascade Learning

February 2, 2024

Need a Small Specialized Language Model? Plan Early!
David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun
Large Language Model Language Model Transformer Based Model Limited Data Inference Cost Fast Inference Domain Training

Inference Cost

Papers

Convolutional vs Large Language Models for Software Log Classification in Edge-Deployable Cellular Network Testing

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Incentivizing Quality Text Generation via Statistical Contracts

High-Performance Temporal Reversible Spiking Neural Networks with $O(L)$ Training Memory and $O(1)$ Inference Cost

Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation

Stable Inverse Reinforcement Learning: Policies from Control Lyapunov Landscapes

A Machine Learning-based Approach for Solving Recurrence Relations and its use in Cost Analysis of Logic Programs

Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks

CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Can LLMs get help from other LLMs without revealing private information?

SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees

Divide-or-Conquer? Which Part Should You Distill Your LLM?

Retrieval-Augmented Thought Process as Sequential Decision Making

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Online Cascade Learning for Efficient Inference over Streams

Need a Small Specialized Language Model? Plan Early!