Block Quantization
Block quantization is a technique for reducing the memory footprint and computational cost of large language models (LLMs) and other deep learning models by representing numerical data with fewer bits. Current research focuses on improving the accuracy of block quantization methods, particularly in handling outliers and optimizing for specific architectures like transformers and graph neural networks, often employing techniques like block floating point formats and cross-block reconstruction. These advancements are crucial for enabling the efficient training and deployment of increasingly large models, impacting both the scalability of research and the accessibility of powerful AI applications.
Papers
March 29, 2024
March 19, 2024
December 13, 2023
September 21, 2023
June 16, 2023