LLM Quantization
LLM quantization aims to reduce the substantial memory and computational demands of large language models (LLMs) by representing their weights and activations using lower precision numbers. Current research focuses on developing efficient quantization algorithms, including techniques like post-training quantization (PTQ) with methods such as vector quantization and layer-wise quantization with varying bit-widths, often incorporating adaptive strategies to minimize performance loss. These advancements are crucial for deploying LLMs on resource-constrained devices and improving the efficiency of LLM inference, impacting both the accessibility of large language models and the sustainability of AI infrastructure.
Papers
February 6, 2024
November 16, 2023
November 3, 2023
October 29, 2023
October 8, 2023
August 16, 2023
July 25, 2023
May 23, 2023