Data Pruning

Data pruning is a technique for efficiently training machine learning models by selectively removing less informative data points from large datasets. Current research focuses on developing effective pruning metrics and algorithms, often leveraging language models, importance sampling, and clustering techniques, to identify and remove redundant or noisy data while preserving model accuracy and robustness across various tasks, including image classification, natural language processing, and molecular modeling. This approach significantly reduces training time and computational costs, impacting both the scalability of deep learning research and the deployment of resource-constrained applications.

Papers