Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Quantifying the Importance of Data Alignment in Downstream Model Performance
Krrish Chawla, Aryan Sahai, Mario DePavia, Sudharsan Sundar, Brando Miranda
Towards Best Practices for Open Datasets for LLM Training
Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang, Leandro von Werra, Mitchell Baker, Julie Belião, Kasia Chmielinski, Marzieh Fadaee, Lisa Gutermuth, Hynek Kydlíček, Greg Leppert, EM Lewis-Jong, Solana Larsen, Shayne Longpre, Angela Oduor Lungati, Cullen Miller, Victor Miller, Max Ryabinin, Kathleen Siminyu, Andrew Strait, Mark Surman, Anna Tumadóttir, Maurice Weber, Rebecca Weiss, Lee White, Thomas Wolf
Linearly Convergent Mixup Learning
Gakuto Obi, Ayato Saito, Yuto Sasaki, Tsuyoshi Kato
Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks
Tianyi Zhang, Linrong Cai, Jeffrey Li, Nicholas Roberts, Neel Guha, Jinoh Lee, Frederic Sala
Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning
Thomas Chen
FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
Erik Henriksson, Otto Tarkka, Filip Ginter
Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training
Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li
Investigating the Impact of Observation Space Design Choices On Training Reinforcement Learning Solutions for Spacecraft Problems
Nathaniel Hamilton, Kyle Dunlap, Kerianne L Hobbs
TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos
Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Kentaro Arai, Seiji Totsuka, Hiroshi Ino, Takayuki Okatani
Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models
Roberto-Rafael Maura-Rivero, Chirag Nagpal, Roma Patel, Francesco Visin
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
Felix Krause, Timy Phan, Vincent Tao Hu, Björn Ommer
Open set label noise learning with robust sample selection and margin-guided module
Yuandi Zhao, Qianxi Xia, Yang Sun, Zhijie Wen, Liyan Ma, Shihui Ying
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman
mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training
Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, Kai Chen
Investigating the Impact of Data Selection Strategies on Language Model Performance
Jiayao Gu, Liting Chen, Yihong Li
An Empirical Study of Accuracy-Robustness Tradeoff and Training Efficiency in Self-Supervised Learning
Fatemeh Ghofrani, Pooyan Jamshidi