Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Addressing Bias in Visualization Recommenders by Identifying Trends in Training Data: Improving VizML Through a Statistical Analysis of the Plotly Community Feed
Allen Tu, Priyanka Mehta, Alexander Wu, Nandhini Krishnan, Amar Mujumdar
Training from a Better Start Point: Active Self-Semi-Supervised Learning for Few Labeled Samples
Ziting Wen, Oscar Pizarro, Stefan Williams
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation
Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, Nicholas Evans
Clarifying MCMC-based training of modern EBMs : Contrastive Divergence versus Maximum Likelihood
Léo Gagnon, Guillaume Lajoie
Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments
Mario Esparza
Items from Psychometric Tests as Training Data for Personality Profiling Models of Twitter Users
Anne Kreuter, Kai Sassenberg, Roman Klinger
A new data augmentation method for intent classification enhancement and its application on spoken conversation datasets
Zvi Kons, Aharon Satt, Hong-Kwang Kuo, Samuel Thomas, Boaz Carmeli, Ron Hoory, Brian Kingsbury
BERT WEAVER: Using WEight AVERaging to enable lifelong learning for transformer-based models in biomedical semantic search engines
Lisa Kühnel, Alexander Schulz, Barbara Hammer, Juliane Fluck
Enabling On-Device Smartphone GPU based Training: Lessons Learned
Anish Das, Young D. Kwon, Jagmohan Chauhan, Cecilia Mascolo