Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Explaining the effects of non-convergent sampling in the training of Energy-Based Models
Elisabeth Agoritsas, Giovanni Catania, Aurélien Decelle, Beatriz Seoane
AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems
Yuan Feng, Hyeran Jeon, Filip Blagojevic, Cyril Guyot, Qing Li, Dong Li
CiT: Curation in Training for Effective Vision-Language Data
Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
Deep Learning for Breast MRI Style Transfer with Limited Training Data
Shixing Cao, Nicholas Konz, James Duncan, Maciej A. Mazurowski
Accuracy and Fidelity Comparison of Luna and DALL-E 2 Diffusion-Based Image Generation Systems
Michael Cahyadi, Muhammad Rafi, William Shan, Jurike Moniaga, Henry Lucky