Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation
Sharif Kazemi, Gloria Gerhardt, Jonty Katz, Caroline Ida Kuria, Estelle Pan, Umang Prabhakar
Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, Junchi Yan
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Jacob Haimes, Cenny Wenner, Kunvar Thaman, Vassil Tashev, Clement Neo, Esben Kran, Jason Schreiber
Training on Fake Labels: Mitigating Label Leakage in Split Learning via Secure Dimension Transformation
Yukun Jiang, Peiran Wang, Chengguo Lin, Ziyue Huang, Yong Cheng
Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training
Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis
Shiho Matta, Yin Jou Huang, Fei Cheng, Hirokazu Kiyomaru, Yugo Murawaki
Training on more Reachable Tasks for Generalisation in Reinforcement Learning
Max Weltevrede, Caroline Horsch, Matthijs T.J. Spaan, Wendelin Böhmer
Should Cross-Lingual AMR Parsing go Meta? An Empirical Assessment of Meta-Learning and Joint Learning AMR Parsing
Jeongwoo Kang, Maximin Coavoux, Cédric Lopez, Didier Schwab
How much can we forget about Data Contamination?
Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg