Depth Pruning
Depth pruning is a model compression technique that removes entire layers or blocks from deep learning models, primarily large language models (LLMs) and convolutional neural networks (CNNs), to reduce computational cost and memory footprint. Current research focuses on developing efficient pruning strategies, including those based on global performance metrics and novel inference-aware criteria, and exploring effective retraining or reconstruction methods to mitigate performance degradation after pruning. This work is significant because it enables the deployment of powerful models on resource-constrained devices, improving efficiency and accessibility across various applications, while also providing valuable insights into the internal structure and importance of different model components.