Model Compression
Model compression aims to reduce the size and computational cost of large deep learning models, particularly large language models (LLMs) and vision transformers, without significant performance loss. Current research focuses on techniques like pruning (structured and unstructured), quantization, knowledge distillation, and novel architecture search methods, often applied to models like BERT, Llama, and ViT. These efforts are crucial for deploying advanced AI models on resource-constrained devices and making them more energy-efficient and accessible, impacting both scientific research and real-world applications. The field also emphasizes the need for comprehensive evaluation metrics beyond simple accuracy, considering factors like safety and robustness.
Papers
An Efficient Real-Time Object Detection Framework on Resource-Constricted Hardware Devices via Software and Hardware Co-design
Mingshuo Liu, Shiyi Luo, Kevin Han, Bo Yuan, Ronald F. DeMara, Yu Bai
Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs
Afia Anjum, Maksim E. Eren, Ismael Boureima, Boian Alexandrov, Manish Bhattarai
Exploring compressibility of transformer based text-to-music (TTM) models
Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan
Speeding Up Image Classifiers with Little Companions
Yang Liu, Kowshik Thopalli, Jayaraman Thiagarajan