Inference Cost
Inference cost, the computational expense of running a machine learning model, is a critical concern, especially for large language models (LLMs) and other resource-intensive architectures. Current research focuses on reducing this cost through various techniques, including model compression (e.g., pruning, quantization, low-rank decomposition), efficient model architectures (e.g., Mixture-of-Experts, sparse networks), and optimized inference strategies (e.g., early exiting, cascading, and specialized prompt handling). Lowering inference costs is crucial for broader deployment of advanced AI models, enabling wider accessibility and reducing the environmental impact of AI computations.
Papers
AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning
Subhojeet Pramanik, Esraa Elelimy, Marlos C. Machado, Adam White
TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
Junyi Liu, Liangzhi Li, Tong Xiang, Bowen Wang, Yiming Qian