Training Datasets

Training datasets are crucial for developing effective machine learning models, particularly large language and vision models, but their size and quality significantly impact model performance, cost, and security. Current research focuses on optimizing dataset size and composition through techniques like dataset distillation, pruning, and automated data generation, as well as mitigating issues arising from memorization of biased or sensitive information within existing datasets via methods such as machine unlearning. These advancements are vital for improving model efficiency, robustness, and ethical considerations across diverse applications, from medical image analysis to natural language processing.

Papers