Training Data
Training data is crucial for machine learning model development, with current research focusing on improving data quality, efficiency, and mitigating biases. Active areas include generating synthetic data to address scarcity or privacy concerns, developing algorithms to optimize data selection and usage (e.g., self-paced learning, active learning), and mitigating issues like data contamination and imbalance through techniques such as data augmentation, selective parameter merging, and novel loss functions. The quality and characteristics of training data significantly impact model performance, generalization, and robustness, influencing various applications from natural language processing and image recognition to scientific computing and medical diagnosis.
Papers - Page 22
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip HowardFRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models
Saeed Rashidi, William Won, Sudarshan Srinivasan, Puneet Gupta, Tushar Krishna
Data Generation Using Large Language Models for Text Classification: An Empirical Case Study
Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito KoishidaHigh-resolution segmentations of the hypothalamus and its subregions for training of segmentation models
Livia Rodrigues, Martina Bocchetta, Oula Puonti, Douglas Greve, Ana Carolina Londe, Marcondes França, Simone Appenzeller, Leticia Rittner+1AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge
Praneeth VadlapatiAligning Teacher with Student Preferences for Tailored Training Data Generation
Yantao Liu, Zhao Zhang, Zijun Yao, Shulin Cao, Lei Hou, Juanzi LiTime Matters: Scaling Laws for Any Budget
Itay Inbar, Luke SernauUniversal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang
Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels
Nicholas Pangakis, Samuel WolkenDistributed Training of Large Graph Neural Networks with Variable Communication Rates
Juan Cervino, Md Asadullah Turja, Hesham Mostafa, Nageen Himayat, Alejandro Ribeiro
DataFreeShield: Defending Adversarial Attacks without Training Data
Hyeyoon Lee, Kanghyun Choi, Dain Kwon, Sunjong Park, Mayoore Selvarasa Jaiswal, Noseong Park, Jonghyun Choi, Jinho LeeDEM: Distribution Edited Model for Training with Mixed Data Distributions
Dhananjay Ram, Aditya Rawal, Momchil Hardalov, Nikolaos Pappas, Sheng Zha
Towards Exact Gradient-based Training on Analog In-memory Computing
Zhaoxian Wu, Tayfun Gokmen, Malte J. Rasch, Tianyi ChenExtracting Training Data from Unconditional Diffusion Models
Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang JiangEnhancing Spatio-temporal Quantile Forecasting with Curriculum Learning: Lessons Learned
Du Yin, Jinliang Deng, Shuang Ao, Zechen Li, Hao Xue, Arian Prabowo, Renhe Jiang, Xuan Song, Flora SalimThe Heterophilic Snowflake Hypothesis: Training and Empowering GNNs for Heterophilic Graphs
Kun Wang, Guibin Zhang, Xinnan Zhang, Junfeng Fang, Xun Wu, Guohao Li, Shirui Pan, Wei Huang, Yuxuan Liang