Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Fair multilingual vandalism detection system for Wikipedia
Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Ricardo Baeza-Yates, Diego Saez-Trumper
VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations
Hoang-Quoc Nguyen-Son, Seira Hidano, Kazuhide Fukushima, Shinsaku Kiyomoto, Isao Echizen
Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training
Rie Johnson, Tong Zhang
A Bayesian Approach To Analysing Training Data Attribution In Deep Learning
Elisa Nguyen, Minjoon Seo, Seong Joon Oh
FedCSD: A Federated Learning Based Approach for Code-Smell Detection
Sadi Alawadi, Khalid Alkharabsheh, Fahed Alkhabbas, Victor Kebande, Feras M. Awaysheh, Fabio Palomba, Mohammed Awad
End-to-end Training of Deep Boltzmann Machines by Unbiased Contrastive Divergence with Local Mode Initialization
Shohei Taniguchi, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo
Transfer Learning for Power Outage Detection Task with Limited Training Data
Olukunle Owolabi
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning
Patrik Okanovic, Roger Waleffe, Vasilis Mageirakos, Konstantinos E. Nikolakakis, Amin Karbasi, Dionysis Kalogerias, Nezihe Merve Gürel, Theodoros Rekatsinas
Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers
Zahra Atashgahi, Mykola Pechenizkiy, Raymond Veldhuis, Decebal Constantin Mocanu
Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers
Chinmaya Kausik, Kashvi Srivastava, Rishi Sonthalia
Training Socially Aligned Language Models on Simulated Social Interactions
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi
An Investigation of Noise in Morphological Inflection
Adam Wiemerslage, Changbing Yang, Garrett Nicolai, Miikka Silfverberg, Katharina Kann