Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Training Compute-Optimal Protein Language Models
Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, Le Song
Training on test proteins improves fitness, structure, and function prediction
Anton Bushuiev, Roman Bushuiev, Nikola Zadorozhny, Raman Samusevich, Hannes Stärk, Jiri Sedlar, Tomáš Pluskal, Josef Sivic
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu
Generative AI-based Pipeline Architecture for Increasing Training Efficiency in Intelligent Weed Control Systems
Sourav Modak, Anthony Stein
Generalizability of Memorization Neural Networks
Lijia Yu, Xiao-Shan Gao, Lijun Zhang, Yibo Miao
TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images
Mengcheng Li, Mingbao Lin, Fei Chao, Chia-Wen Lin, Rongrong Ji
Training and Evaluating Causal Forecasting Models for Time-Series
Thomas Crasson, Yacine Nabet, Mathias Lécuyer
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong
Global Convergence in Training Large-Scale Transformers
Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Matthew Klusowski, Jianqing Fan
Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation
Chenyang An, Shima Imani, Feng Yao, Chengyu Dong, Ali Abbasi, Harsh Shrivastava, Samuel Buss, Jingbo Shang, Gayathri Mahalingam, Pramod Sharma, Maurice Diesendruck
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, Ravi Kumar
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov, Pierre-Carl Langlais
Vertical Federated Learning with Missing Features During Training and Inference
Pedro Valdeira, Shiqiang Wang, Yuejie Chi
Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
Badr AlKhamissi, Yingtian Tang, Abdülkadir Gökce, Johannes Mehrer, Martin Schrimpf