LLM Compression

LLM compression aims to reduce the substantial computational and memory demands of large language models (LLMs) while preserving their performance. Current research focuses on techniques like pruning, quantization, and low-rank decomposition, often applied to models such as LLaMA, exploring the trade-offs between compression ratios and accuracy across various downstream tasks and evaluating the impact on model safety and fairness. This field is crucial for enabling the deployment of LLMs on resource-constrained devices and improving their accessibility and efficiency in real-world applications.

Papers