LLM Quantization
LLM quantization aims to reduce the substantial memory and computational demands of large language models (LLMs) by representing their weights and activations using lower precision numbers. Current research focuses on developing efficient quantization algorithms, including techniques like post-training quantization (PTQ) with methods such as vector quantization and layer-wise quantization with varying bit-widths, often incorporating adaptive strategies to minimize performance loss. These advancements are crucial for deploying LLMs on resource-constrained devices and improving the efficiency of LLM inference, impacting both the accessibility of large language models and the sustainability of AI infrastructure.
Papers
Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang