Weight Only Quantization

Weight-only quantization aims to reduce the memory footprint and computational cost of large language models (LLMs) and other deep learning models by representing their weights using fewer bits, typically 2-4 bits, without retraining. Current research focuses on techniques like vector quantization, adaptive quantization strategies (e.g., per-channel or per-group), and optimized matrix multiplication kernels to minimize accuracy loss at extremely low bit-widths, often employing lookup tables for efficient dequantization. This research is significant because it enables the deployment of increasingly large models on resource-constrained devices, improving both the efficiency and accessibility of advanced AI applications.

Papers

October 3, 2023

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
Native Robustness Low Bit Quantization Model Scaling Weight Only Quantization Expert Parallelism

September 27, 2023

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee
Language Model Weight Only Quantization Bit Weight Quantization Input Quantization

September 24, 2023

Probabilistic Weight Fixing: Large-scale training of neural network weight uncertainties for quantization
Christopher Subia-Waud, Srinandan Dasmahapatra
Neural Network Bayesian Neural Network Multiplier Free Quantization Weight Only Quantization Large Scale Training State of the Art Quantization

September 11, 2023

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv, Yi Liu
Large Language Model Gradient Descent Quantization Operator Quantization Aware Training Weight Only Quantization Weight Optimization

September 6, 2023

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Liang Li, Qingyuan Li, Bo Zhang, Xiangxiang Chu
Large Language Model Rapid Norm Growth Weight Only Quantization Bit Quantization

August 16, 2023

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla
Large Language Model Fine Grained Enhancing Efficiency Dense Model LLM Quantization Weight Only Quantization Auto Regressive Decoding

February 4, 2023

Oscillation-free Quantization for Low-bit Vision Transformers
Shih-Yang Liu, Zechun Liu, Kwang-Ting Cheng
Quantization Aware Training Quantization Technique Weight Only Quantization Weight Monitoring Scaling Factor Bit Vision Transformer

December 12, 2022

Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction
Shiwei Li, Huifeng Guo, Lu Hou, Wei Zhang, Xing Tang, Ruiming Tang, Rui Zhang, Ruixuan Li
Jina Embeddings Click Through Rate Prediction Weight Only Quantization Low Precision Training Stochastic Quantization

June 20, 2022

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee
Efficient Inference Matrix Multiplication Low Bit Quantization Weight Only Quantization Efficient Dequantization LUT Based

Weight Only Quantization

Papers

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

Probabilistic Weight Fixing: Large-scale training of neural network weight uncertainties for quantization

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Oscillation-free Quantization for Low-bit Vision Transformers

Adaptive Low-Precision Training for Embeddings in Click-Through Rate Prediction

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models