Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers - Page 17
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs
Shadi Iskander, Nachshon Cohen, Zohar Karnin, Ori Shapira, Sofia TolmachFederated Large Language Models: Current Progress and Future Directions
Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley+2
Measuring Copyright Risks of Large Language Model via Partial Information Probing
Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, Denghui ZhangData Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation?
Alexander Jaus, Simon Reiß, Jens Kleesiek, Rainer StiefelhagenValidity of Feature Importance in Low-Performing Machine Learning for Tabular Biomedical Data
Youngro Lee, Giacomo Baruzzo, Jeonghwan Kim, Jongmo Seo, Barbara Di Camillo
Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust Distillation
Tsung-Han Wu, Hung-Ting Su, Shang-Tse Chen, Winston H. HsuDomain-stratified Training for Cross-organ and Cross-scanner Adenocarcinoma Segmentation in the COSAS 2024 Challenge
Huang Jiayan, Ji Zheng, Kuang Jinbo, Xu Shuoyu
Extracting Memorized Training Data via Decomposition
Ellen Su, Anu Vellore, Amy Chang, Raffaele Mura, Blaine Nelson, Paul Kassianik, Amin KarbasiPromise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization
Zhi Chen, Lingxiao JiangAccelerating the Training and Improving the Reliability of Machine-Learned Interatomic Potentials for Strongly Anharmonic Materials through Active Learning
Kisung Kang, Thomas A. R. Purcell, Christian Carbogno, Matthias Scheffler
Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
Ilaria Manco, Justin Salamon, Oriol NietoTraining Datasets Generation for Machine Learning: Application to Vision Based Navigation
Jérémy Lebreton, Ingo Ahrns, Roland Brochard, Christoph Haskamp, Hans Krüger, Matthieu Le Goff, Nicolas Menga, Nicolas Ollagnier, Ralf Regele+2Linear Recency Bias During Training Improves Transformers' Fit to Reading Times
Christian Clark, Byung-Doh Oh, William SchulerVolvo Discovery Challenge at ECML-PKDD 2024
Mahmoud Rahat, Peyman Sheikholharam Mashhadi, Sławomir Nowaczyk, Shamik Choudhury, Leo Petrin, Thorsteinn Rognvaldsson, Andreas Voskou+2