Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level
Daniel Deutsch, Juraj Juraska, Mara Finkelstein, Markus Freitag
On the Impact of Language Selection for Training and Evaluating Programming Language Models
Jonathan Katzy, Maliheh Izadi, Arie van Deursen
A Generic Machine Learning Framework for Fully-Unsupervised Anomaly Detection with Contaminated Data
Markus Ulmer, Jannik Zgraggen, Lilach Goren Huber
Training normalizing flows with computationally intensive target probability distributions
Piotr Bialas, Piotr Korcyl, Tomasz Stebel
Stabilizing Training with Soft Dynamic Time Warping: A Case Study for Pitch Class Estimation with Weakly Aligned Targets
Johannes Zeitler, Simon Deniffel, Michael Krause, Meinard Müller
OpenProteinSet: Training data for structural biology at scale
Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Lukas Jarosch, Daniel Berenberg, Ian Fisk, Andrew M. Watkins, Stephen Ra, Richard Bonneau, Mohammed AlQuraishi
Can We Transfer Noise Patterns? A Multi-environment Spectrum Analysis Model Using Generated Cases
Haiwen Du, Zheng Ju, Yu An, Honghui Du, Dongjie Zhu, Zhaoshuo Tian, Aonghus Lawlor, Ruihai Dong
SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems
Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh