Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias
Sierra Wyllie, Ilia Shumailov, Nicolas Papernot
Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost
Oana Ignat, Longju Bai, Joan Nwatu, Rada Mihalcea
IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers
Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh
Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations
Stefan Balauca, Mark Niklas Müller, Yuhao Mao, Maximilian Baader, Marc Fischer, Martin Vechev
Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models
Philip Harris, Michael Kagan, Jeffrey Krupa, Benedikt Maier, Nathaniel Woodward
Transfer Learning for Security: Challenges and Future Directions
Adrian Shuai Li, Arun Iyengar, Ashish Kundu, Elisa Bertino
Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning
Yixiong Zou, Yicong Liu, Yiman Hu, Yuhua Li, Ruixuan Li
Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey
Lucas Schott, Josephine Delas, Hatem Hajri, Elies Gherbi, Reda Yaich, Nora Boulahia-Cuppens, Frederic Cuppens, Sylvain Lamprier
Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance
Rachith Aiyappa, Shruthi Senthilmani, Jisun An, Haewoon Kwak, Yong-Yeol Ahn
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
Aivin V. Solatorio
A Survey on Data Selection for Language Models
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
EEG classifier cross-task transfer to avoid training sessions in robot-assisted rehabilitation
Niklas Kueper, Su Kyoung Kim, Elsa Andrea Kirchner