Efficient Large Language Model
Efficient Large Language Models (LLMs) aim to reduce the substantial computational cost and memory demands of current LLMs while maintaining high performance. Research focuses on optimizing model architectures (e.g., exploring alternatives to Transformers, employing linear attention mechanisms, and utilizing state space models), developing efficient training and inference techniques (like knowledge distillation, pruning, quantization, and parameter-efficient fine-tuning), and leveraging hardware optimizations (including specialized hardware and heterogeneous GPU allocation). These advancements are crucial for making LLMs more accessible and deployable in resource-constrained environments, broadening their applicability across various scientific domains and practical applications.
Papers
DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models
Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo, Yunhe Wang
LLM Inference Unveiled: Survey and Roofline Model Insights
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau