Inference Cost
Inference cost, the computational expense of running a machine learning model, is a critical concern, especially for large language models (LLMs) and other resource-intensive architectures. Current research focuses on reducing this cost through various techniques, including model compression (e.g., pruning, quantization, low-rank decomposition), efficient model architectures (e.g., Mixture-of-Experts, sparse networks), and optimized inference strategies (e.g., early exiting, cascading, and specialized prompt handling). Lowering inference costs is crucial for broader deployment of advanced AI models, enabling wider accessibility and reducing the environmental impact of AI computations.
Papers
Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications
Yang Li, Changsheng Zhao, Hyungtak Lee, Ernie Chang, Yangyang Shi, Vikas Chandra
Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation
Boyuan Chen, Mingzhi Zhu, Brendan Dolan-Gavitt, Muhammad Shafique, Siddharth Garg