Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers
Regurgitative Training: The Value of Real Data in Training Large Language Models
Jinghui Zhang, Dandan Qiao, Mochen Yang, Qiang Wei
ITEM: Improving Training and Evaluation of Message-Passing based GNNs for top-k recommendation
Yannis Karmim, Elias Ramzi, Raphaël Fournier-S'niehotta, Nicolas Thome
PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding
Krishna Kanth Nakka, Ahmed Frikha, Ricardo Mendes, Xue Jiang, Xuebing Zhou
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard
FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models
Saeed Rashidi, William Won, Sudarshan Srinivasan, Puneet Gupta, Tushar Krishna
Data Generation Using Large Language Models for Text Classification: An Empirical Case Study
Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito Koishida
High-resolution segmentations of the hypothalamus and its subregions for training of segmentation models
Livia Rodrigues, Martina Bocchetta, Oula Puonti, Douglas Greve, Ana Carolina Londe, Marcondes França, Simone Appenzeller, Leticia Rittner, Juan Eugenio Iglesias
AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning
Praneeth Vadlapati