Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Stability of Accuracy for the Training of DNNs Via the Uniform Doubling Condition
Yitzchak Shmalo
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, Jenia Jitsev
$\alpha$QBoost: An Iteratively Weighted Adiabatic Trained Classifier
Salvatore Certo, Andrew Vlasic, Daniel Beaulieu
Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition
Shuguang Chen, Leonardo Neves, Thamar Solorio
When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture
Yichuan Mo, Dongxian Wu, Yifei Wang, Yiwen Guo, Yisen Wang