Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Training and inference of large language models using 8-bit floating point
Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, Jun Wang
Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning
William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe
Boosting High Resolution Image Classification with Scaling-up Transformers
Yi Wang
Cross-Validation for Training and Testing Co-occurrence Network Inference Algorithms
Daniel Agyapong, Jeffrey Ryan Propster, Jane Marks, Toby Dylan Hocking
Fixing the problems of deep neural networks will require better training data and learning algorithms
Drew Linsley, Thomas Serre
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Zeyuan Allen-Zhu, Yuanzhi Li
REPA: Client Clustering without Training and Data Labels for Improved Federated Learning in Non-IID Settings
Boris Radovič, Veljko Pejović
Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays
Sivaramakrishnan Rajaraman, Ghada Zamzmi, Feng Yang, Zhaohui Liang, Zhiyun Xue, Sameer Antani
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs
Jonas Golde, Patrick Haller, Felix Hamborg, Julian Risch, Alan Akbik