COreset Selection

Coreset selection aims to identify a small, representative subset of a large dataset that preserves essential information for machine learning tasks, thereby reducing computational costs and improving efficiency. Current research focuses on developing novel coreset selection algorithms tailored to specific model architectures (e.g., LLMs, GNNs, CNNs) and learning paradigms (e.g., federated learning, self-supervised learning), often employing techniques like gradient clustering, spectral embeddings, and Wasserstein distance minimization. This field is significant because efficient coreset selection can accelerate training, improve model robustness, and enable the application of machine learning to massive datasets that were previously intractable, impacting diverse areas from natural language processing to medical image analysis.

Papers