Per Tensor

Per-tensor quantization aims to improve the efficiency of large language models (LLMs) by representing model weights and activations using fewer bits per tensor, thereby reducing memory footprint and accelerating inference. Current research focuses on mitigating the accuracy loss associated with this compression, employing techniques like low-rank matrix decompositions, outlier detection and pre-processing to optimize quantization schemes (e.g., per-tensor static quantization) and achieve performance comparable to or exceeding dynamic quantization methods. These advancements are significant for deploying LLMs on resource-constrained devices and accelerating their practical applications, particularly in areas like natural language processing and reinforcement learning.

Papers