Quantization Operator
Quantization is a model compression technique that reduces the precision of numerical representations in neural networks, aiming to decrease computational costs and memory footprint while preserving model accuracy. Current research focuses on applying quantization to various deep learning architectures, including Vision Transformers (ViTs), large language models (LLMs), and diffusion models, often employing post-training quantization (PTQ) methods to avoid retraining the entire model. This work is significant because it enables the deployment of large, computationally expensive models on resource-constrained devices, impacting fields like healthcare, edge computing, and natural language processing by making advanced AI more accessible and efficient.
Papers
Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets
Lu Zeng, Sree Hari Krishnan Parthasarathi, Yuzong Liu, Alex Escott, Santosh Kumar Cheekatmalla, Nikko Strom, Shiv Vitaladevuni
DiverGet: A Search-Based Software Testing Approach for Deep Neural Network Quantization Assessment
Ahmed Haj Yahmed, Houssem Ben Braiek, Foutse Khomh, Sonia Bouzidi, Rania Zaatour