Training Datasets
Training datasets are crucial for developing effective machine learning models, particularly large language and vision models, but their size and quality significantly impact model performance, cost, and security. Current research focuses on optimizing dataset size and composition through techniques like dataset distillation, pruning, and automated data generation, as well as mitigating issues arising from memorization of biased or sensitive information within existing datasets via methods such as machine unlearning. These advancements are vital for improving model efficiency, robustness, and ethical considerations across diverse applications, from medical image analysis to natural language processing.
Papers
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, Josef Sivic
FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan