Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Yuanyuan Wang, Fu Wu, Jiezhong Qiu, Aimin Pan
PneumoLLM: Harnessing the Power of Large Language Model for Pneumoconiosis Diagnosis
Meiyue Song, Zhihua Yu, Jiaxin Wang, Jiarui Wang, Yuting Lu, Baicun Li, Xiaoxu Wang, Qinghua Huang, Zhijun Li, Nikolaos I. Kanellakis, Jiangfeng Liu, Jing Wang, Binglu Wang, Juntao Yang
Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data
Tashi Namgyal, Alexander Hepburn, Raul Santos-Rodriguez, Valero Laparra, Jesus Malo
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du, Haoxin Li, Xu Guo, Boyang Li
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks
Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah
RINAS: Training with Dataset Shuffling Can Be General and Fast
Tianle Zhong, Jiechen Zhao, Xindi Guo, Qiang Su, Geoffrey Fox
Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy
Danqi Liao, Chen Liu, Benjamin W. Christensen, Alexander Tong, Guillaume Huguet, Guy Wolf, Maximilian Nickel, Ian Adelstein, Smita Krishnaswamy
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
Matthew Gwilliam, Michael Cogswell, Meng Ye, Karan Sikka, Abhinav Shrivastava, Ajay Divakaran
Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural Networks
Juyoung Yun
TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers
Juhyeon Park, Seokhyeon Jeong, Taesup Moon
Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
Max Milkert, David Hyde, Forrest Laine
End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data
Can Cui (MULTISPEECH), Imran Ahamad Sheikh, Mostafa Sadeghi (MULTISPEECH), Emmanuel Vincent (MULTISPEECH)
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
Scalable Extraction of Training Data from (Production) Language Models
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
Training Chain-of-Thought via Latent-Variable Inference
Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous