Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs
Shadi Iskander, Nachshon Cohen, Zohar Karnin, Ori Shapira, Sofia Tolmach
Federated Large Language Models: Current Progress and Future Directions
Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong
Measuring Copyright Risks of Large Language Model via Partial Information Probing
Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, Denghui Zhang
Data Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation?
Alexander Jaus, Simon Reiß, Jens Klesiek, Rainer Stiefelhagen
Validity of Feature Importance in Low-Performing Machine Learning for Tabular Biomedical Data
Youngro Lee, Giacomo Baruzzo, Jeonghwan Kim, Jongmo Seo, Barbara Di Camillo
Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust Distillation
Tsung-Han Wu, Hung-Ting Su, Shang-Tse Chen, Winston H. Hsu
Domain-stratified Training for Cross-organ and Cross-scanner Adenocarcinoma Segmentation in the COSAS 2024 Challenge
Huang Jiayan, Ji Zheng, Kuang Jinbo, Xu Shuoyu
Extracting Memorized Training Data via Decomposition
Ellen Su, Anu Vellore, Amy Chang, Raffaele Mura, Blaine Nelson, Paul Kassianik, Amin Karbasi
Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization
Zhi Chen, Lingxiao Jiang
Accelerating the Training and Improving the Reliability of Machine-Learned Interatomic Potentials for Strongly Anharmonic Materials through Active Learning
Kisung Kang, Thomas A. R. Purcell, Christian Carbogno, Matthias Scheffler
Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
Ilaria Manco, Justin Salamon, Oriol Nieto
Training Datasets Generation for Machine Learning: Application to Vision Based Navigation
Jérémy Lebreton, Ingo Ahrns, Roland Brochard, Christoph Haskamp, Matthieu Le Goff, Nicolas Menga, Nicolas Ollagnier, Ralf Regele, Francesco Capolupo, Massimo Casasco
Linear Recency Bias During Training Improves Transformers' Fit to Reading Times
Christian Clark, Byung-Doh Oh, William Schuler
Volvo Discovery Challenge at ECML-PKDD 2024
Mahmoud Rahat, Peyman Sheikholharam Mashhadi, Sławomir Nowaczyk, Shamik Choudhury, Leo Petrin, Thorsteinn Rognvaldsson, Andreas Voskou, Carlo Metta, Claudio Savelli