Different Quantization
Different quantization techniques aim to reduce the computational cost and memory footprint of large language models (LLMs) and other deep neural networks without significant accuracy loss. Current research focuses on developing novel quantization methods, including mixed-precision approaches that assign different bit-widths to various model parameters (e.g., weights, activations, and key/value caches), and exploring optimal quantization strategies for specific model architectures like Vision Transformers and LLMs. These advancements are crucial for deploying large models on resource-constrained devices and improving the efficiency of AI applications across various domains.
Papers
November 19, 2024
November 4, 2024
October 30, 2024
September 30, 2024
September 25, 2024
September 17, 2024
August 26, 2024
July 26, 2024
July 22, 2024
June 1, 2024
April 3, 2024
February 19, 2024
November 30, 2023
August 25, 2023
June 1, 2023
May 30, 2023
January 15, 2023
October 14, 2022