Model Compression
Model compression aims to reduce the size and computational cost of large deep learning models, particularly large language models (LLMs) and vision transformers, without significant performance loss. Current research focuses on techniques like pruning (structured and unstructured), quantization, knowledge distillation, and novel architecture search methods, often applied to models like BERT, Llama, and ViT. These efforts are crucial for deploying advanced AI models on resource-constrained devices and making them more energy-efficient and accessible, impacting both scientific research and real-world applications. The field also emphasizes the need for comprehensive evaluation metrics beyond simple accuracy, considering factors like safety and robustness.
Papers
Efficient Model Compression Techniques with FishLeg
Jamie McGowan, Wei Sheng Lai, Weibin Chen, Henry Aldridge, Jools Clarke, Jezabel Garcia, Rui Xia, Yilei Liang, Guillaume Hennequin, Alberto Bernacchia
CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models
Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo