Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Interactive Image Selection and Training for Brain Tumor Segmentation Network
Matheus A. Cerqueira, Flávia Sprenger, Bernardo C. A. Teixeira, Alexandre X. Falcão
A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods
Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
Model for Peanuts: Hijacking ML Models without Training Access is Possible
Mahmoud Ghorbel, Halima Bouzidi, Ioan Marius Bilasco, Ihsen Alouani
SAVA: Scalable Learning-Agnostic Data Valuation
Samuel Kessler, Tam Le, Vu Nguyen
Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning
Kai Yi, Timur Kharisov, Igor Sokolov, Peter Richtárik
Frequency Enhanced Pre-training for Cross-city Few-shot Traffic Forecasting
Zhanyu Liu, Jianrong Ding, Guanjie Zheng
Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignment
Chen Zhang, Qiang He, Zhou Yuan, Elvis S. Liu, Hong Wang, Jian Zhao, Yang Wang
How In-Context Learning Emerges from Training on Unstructured Data: On the Role of Co-Occurrence, Positional Information, and Noise Structures
Kevin Christian Wibisono, Yixin Wang
Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment
Mark Lowell, Catharine Kastner
Latent Intrinsics Emerge from Training to Relight
Xiao Zhang, William Gao, Seemandhar Jain, Michael Maire, David. A. Forsyth, Anand Bhattad
Conditioning GAN Without Training Dataset
Kidist Amde Mekonnen
Mitigating the Impact of Labeling Errors on Training via Rockafellian Relaxation
Louis L. Chen, Bobbie Chern, Eric Eckstrand, Amogh Mahapatra, Johannes O. Royset
SPOT: Text Source Prediction from Originality Score Thresholding
Edouard Yvinec, Gabriel Kasser
Improving the Training of Rectified Flows
Sangyun Lee, Zinan Lin, Giulia Fanti
Can the accuracy bias by facial hairstyle be reduced through balancing the training data?
Kagan Ozturk, Haiyu Wu, Kevin W. Bowyer
Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology
Frank A. Ruis, Alma M. Liezenga, Friso G. Heslinga, Luca Ballan, Thijs A. Eker, Richard J. M. den Hollander, Martin C. van Leeuwen, Judith Dijk, Wyke Huizinga
Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce A. Bassett, Sara Hooker
Deep Positive-Unlabeled Anomaly Detection for Contaminated Unlabeled Data
Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Yuuki Yamanaka