Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang
NeuPL: Neural Population Learning
Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, Thore Graepel
Don't stop the training: continuously-updating self-supervised algorithms best account for auditory responses in the cortex
Pierre Orhan, Yves Boubenec, Jean-Rémi King
Multi-style Training for South African Call Centre Audio
Walter Heymans, Marelie H. Davel, Charl van Heerden
Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data
Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez, Kannan Ramchandran, Charles H. Martin, Michael W. Mahoney
Learning to be a Statistician: Learned Estimator for Number of Distinct Values
Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, Jingren Zhou
On Smart Gaze based Annotation of Histopathology Images for Training of Deep Convolutional Neural Networks
Komal Mariam, Osama Mohammed Afzal, Wajahat Hussain, Muhammad Umar Javed, Amber Kiyani, Nasir Rajpoot, Syed Ali Khurram, Hassan Aqeel Khan
The CORAL++ Algorithm for Unsupervised Domain Adaptation of Speaker Recogntion
Rongjin Li, Weibin Zhang, Dongpeng Chen
3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation
Peter Richtárik, Igor Sokolov, Ilyas Fatkhullin, Elnur Gasanov, Zhize Li, Eduard Gorbunov